The insurance dataset is about collection of metrics describing the profile of the person who is insured and it some other information regarding to insurance charges and region. We could see correlation plot to take a look at variables and its relations.
age and charges are showing moderate positive relationship with Cor = 0.299 meaning that older individuals tend to have higher insurance charges.
There is a weak positive relationship between bmi vs chargesCor = 0.198 showing us that as bmi increase, insurance charges tend to increase.
children and charges showing very weak relationship Cor = 0.068, number of children have a minimal positive impact on charges. We could investigate their relationship using different visualization plots.
charges variable showing right skew distribution. There might be potential outliers in our dataset.
Since smoker is categorical variable and it doesn’t show any correlation number, however boxplot showing smoking status likely drives a major portion of insurance charges. Next we will investigate smoker status affect on insurance charges.
library(ggplot2)ggplot(data = insurance, mapping =aes(y = charges, x = smoker, fill = smoker)) +geom_boxplot()
Boxplot shows us that Non-smokers usually pay less than $10,000, while most smokers pay between $20,000 and $40,000. Even the cheapest medical bills for smokers are typically higher than the average bills for non-smokers.
library(dplyr)library(scales) insurance_summary <- insurance %>%group_by(children) %>%summarize(mean_charges =mean(charges))ggplot(insurance_summary, aes(x = children, y = mean_charges, fill =as.factor(children))) +geom_col() +geom_text(aes(label =dollar(mean_charges)), angle =90, color ="white", position =position_stack(vjust =0.5)) +scale_y_continuous(labels = comma) +scale_x_continuous(breaks =seq(0, max(insurance_summary$children), by =1)) +labs(title ="Average Charges by Number of Children",x ="Number of Children", y ="Mean Charges",fill ="Children") +theme_minimal()
Bar chart shows how the number of children relates to average medical charges. Costs slowly increase as the number of children goes from zero to three, peaking at a high of $15,355.32 for families with three children. However, after that point, the charges drop significantly, with families of five children having the lowest average cost at $8,786.04.
ggplot(data = insurance, mapping =aes(y = charges, x = region, color = region)) +geom_boxplot()
While the median costs remain relatively consistent across all four regions, the distribution of expenses varies, with the southeast region showing the widest range of charges. Every region contains a significant number of high cost outliers, showing us that there are some individuals whose insurance charges far more than typical regional average.
Fitting Linear Regression Model and Interpretations of outputs
Call:
lm(formula = charges ~ ., data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
age 256.9 11.9 21.587 < 2e-16 ***
sexmale -131.3 332.9 -0.394 0.693348
bmi 339.2 28.6 11.860 < 2e-16 ***
children 475.5 137.8 3.451 0.000577 ***
smokeryes 23848.5 413.1 57.723 < 2e-16 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Multiple R-squared = 0.7509 means that our model explains 75% of the variation in insurance charges.
Adjusted R-squared = 0.7494 means that after adjusting for the number of predictors, the model still explains 75% of the variation.
The coefficient for age is 256.9, meaning that holding sex, BMI, children, smoker status, and region fixed, a 1 year increase in age is associated with an increase, in predicted charges of about $256.90. The p-value < 2e-16 is highly significant, suggesting that age has a strong independent positive relationship with medical charges.
The coefficient for sex (male) is -131.3, meaning that holding all other variables fixed, males have predicted charges about $131 lower than females. However, the p-value = 0.693 is not statistically significant, meaning sex does not have a meaningful independent effect in this model.
The coefficient for BMI is 339.2, meaning that holding all other variables fixed, a 1-unit increase in BMI is associated with an increase in predicted charges of about $339.20. The p-value < 2e-16 is highly significant, suggesting BMI has a strong independent positive effect on medical charges.
The coefficient for children is 475.5, meaning that holding other variables fixed, each additional child is associated with an increase in predicted charges of about $475.50. The p-value = 0.000577 is statistically significant, indicating children has a meaningful positive effect on charges.
The coefficient for smoker (yes) is 23848.5, meaning that holding all other variables fixed, smokers have predicted charges about $23,848 higher than non-smokers. The p-value < 2e-16 is extremely significant, suggesting smoking status is the strongest predictor in the model. This is a very large magnitude effect relative to the average charge (~$13,270)
In terms of regional price indicators relative to the Northeast, both the Southeast and Southwest show statistically significant differences, with charges being approximately $1,035 lower(p = 0.031) and $960 lower(p = 0.045), respectively. In contrast, the Northwest shows a decrease of $353.0, which is not statistically significant (p = 0.459), indicating no meaningful difference in charges compared to the Northeast.
T-Test for our model and Fitting Reduced Model
The individual t-tests indicate that age, bmi, children, smoker, and two regional indicators significantly predict insurance charges, while sex and the northwest region do not have statistically significant independent effects on insurance charges.
reduced_model <-lm(charges ~ age + bmi + children + smoker + region, data = insurance)summary(reduced_model)
Call:
lm(formula = charges ~ age + bmi + children + smoker + region,
data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11367.2 -2835.4 -979.7 1361.9 29935.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
age 256.97 11.89 21.610 < 2e-16 ***
bmi 338.66 28.56 11.858 < 2e-16 ***
children 474.57 137.74 3.445 0.000588 ***
smokeryes 23836.30 411.86 57.875 < 2e-16 ***
regionnorthwest -352.18 476.12 -0.740 0.459618
regionsoutheast -1034.36 478.54 -2.162 0.030834 *
regionsouthwest -959.37 477.78 -2.008 0.044846 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
anova(full_model, reduced_model)
Analysis of Variance Table
Model 1: charges ~ age + sex + bmi + children + smoker + region
Model 2: charges ~ age + bmi + children + smoker + region
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1329 4.8840e+10
2 1330 4.8845e+10 -1 -5716429 0.1556 0.6933
An F-test comparing the full model to our reduced model excluding sex yielded F = 0.156 with p = 0.693. Since the p-value is much greater than 0.05, we fail to reject the null hypothesis that the coefficient for sex is zero. Removing sex does not significantly reduce model fit. Therefore, sex does not provide meaningful explanatory power and is excluded from the final model.
library(car)
Warning: package 'car' was built under R version 4.5.2
Loading required package: carData
Warning: package 'carData' was built under R version 4.5.2
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
vif(reduced_model)
GVIF Df GVIF^(1/(2*Df))
age 1.016188 1 1.008061
bmi 1.104197 1 1.050808
children 1.003714 1 1.001855
smoker 1.006369 1 1.003179
region 1.098870 3 1.015838
The Variance Inflation Factor (VIF) results indicate no evidence of multicollinearity in our reduced model
Model diagnostics
par(mfrow =c(2,2))plot(reduced_model)
The diagnostic plots suggest some violations of classical linear regression assumptions. The Residuals vs Fitted plot shows a slight curved pattern and increasing spread at higher fitted values, indicating potential non-linearity and heteroscedasticity.
The Scale-Location plot further confirms non-constant variance, as residual variability increases with predicted charges.
The Q-Q plot shows noticeable deviation in the upper tail, suggesting that residuals are not perfectly normally distributed, likely due to extreme high-cost observations.
The Residuals vs Leverage plot does not show any major influential points beyond Cook’s distance, although a few observations have slightly higher leverage. Overall, multicollinearity is not an issue in this model, but there is some heteroscedasticity and slight deviation from normality.
Model Remedy
At this point I will apply log transformation to charges in our reduced model that could help improving the model assumptions and make the results more reliable.
log_model <-lm(log(charges) ~ age + bmi + children + smoker + region, data = insurance)summary(log_model)
Call:
lm(formula = log(charges) ~ age + bmi + children + smoker + region,
data = insurance)
Residuals:
Min 1Q Median 3Q Max
-1.10302 -0.19707 -0.05206 0.06564 2.15091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0008478 0.0719853 97.254 < 2e-16 ***
age 0.0346490 0.0008746 39.618 < 2e-16 ***
bmi 0.0130711 0.0021004 6.223 6.52e-10 ***
children 0.1013204 0.0101304 10.002 < 2e-16 ***
smokeryes 1.5472965 0.0302910 51.081 < 2e-16 ***
regionnorthwest -0.0633386 0.0350174 -1.809 0.070712 .
regionsoutheast -0.1568166 0.0351952 -4.456 9.07e-06 ***
regionsouthwest -0.1285638 0.0351393 -3.659 0.000263 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4457 on 1330 degrees of freedom
Multiple R-squared: 0.7663, Adjusted R-squared: 0.765
F-statistic: 622.9 on 7 and 1330 DF, p-value: < 2.2e-16
Multiple R-squared = 0.7663 means that the model explains about 76.6% of the variation in log(charges). Adjusted R-squared = 0.765 means that after adjusting for predictors, the model still explains about 76.5% of the variation, which shows a strong overall fit and slightly better than the our previous reduced model.
par(mfrow =c(2,2))plot(log_model)
The red line is not flat, it shows a noticeable downward curve as fitted values increase. There is also a distinct “split” in the residuals left part of the plot. This suggest that there is non-linear relationship that our model is not capturing.
Q-Q residuals plot is showing heavy tail towards the right, meaning there are more extreme “outlier” errors than a normal distribution expects.
The Scale-Location plot shows decrease and increase aftterwards. This indicates heteroscedasticity, meaning the model’s prediction error isn’t constant across all data ranges.
Residuals vs Leverage plot does not indicate any major influential observations beyond Cook’s distance. While there are some high leverage points, none seem to be single handedly affecting the model, however they should be investigated closely.
plot(log_model, which =4)
The Cook’s Distance plot shows a few observations with moderately higher influence (220, 431 and 1028), but all values remain well below 1. Although some exceed the 4/(1338 number of observations)≈0.003 guideline, none appear large enough to be considered highly influential. Therefore, no single observation is unduly driving the regression results, and the model appears stable.
cooksD <-cooks.distance(log_model)inf_points <-which(cooksD >0.003) # I utilized this threshold since it is close to 4/n=1338 length(inf_points)
Original Without_Inf
(Intercept) 7.00084777 6.780727831
age 0.03464897 0.040752119
bmi 0.01307111 0.009396141
children 0.10132039 0.111972050
smokeryes 1.54729653 1.583431473
regionnorthwest -0.06333857 -0.091761940
regionsoutheast -0.15681659 -0.150078978
regionsouthwest -0.12856380 -0.119116357
age is showin 18% increase from 0.0346 to 0.0408 this is a moderate change but still positive and significant.
bmi is showing -28% decrease but still statistically significant, suggesting that some influential observations were affecting bmi estmiated affect.
par(mfrow =c(2,2))plot(log_model_reduced)
Our diagnostic plots are not showing expected improvement, also taking out 105 observation points might not be the best way to approach predicitng charges as they cover 8% of our dataset. Therefore we are going to look for alternative option for model remedy.
Remedy 2
At this point we have to consider posibble interactions we could utilize to improve predicting power of our model. In our previous discoveries we have noticed that smoker and bmi has strong affect on insurance charges. Higher bmi and being smoker could have an ultimate impact on insurance charges, therefore combining those variable can give us better prediction without making changes to our dataset.
interaction_model <-lm(log(charges) ~ age + children + region + bmi*smoker,data = insurance)summary(interaction_model)
Call:
lm(formula = log(charges) ~ age + children + region + bmi * smoker,
data = insurance)
Residuals:
Min 1Q Median 3Q Max
-0.95236 -0.17362 -0.04363 0.04653 2.18043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2973126 0.0762591 95.691 < 2e-16 ***
age 0.0348687 0.0008466 41.185 < 2e-16 ***
children 0.1025063 0.0098040 10.456 < 2e-16 ***
regionnorthwest -0.0704737 0.0338945 -2.079 0.0378 *
regionsoutheast -0.1621837 0.0340629 -4.761 2.13e-06 ***
regionsouthwest -0.1369021 0.0340154 -4.025 6.03e-05 ***
bmi 0.0032462 0.0022779 1.425 0.1544
smokeryes 0.1749616 0.1466030 1.193 0.2329
bmi:smokeryes 0.0447061 0.0046794 9.554 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4313 on 1329 degrees of freedom
Multiple R-squared: 0.7813, Adjusted R-squared: 0.78
F-statistic: 593.5 on 8 and 1329 DF, p-value: < 2.2e-16
par(mfrow =c(2,2))plot(interaction_model)
The interaction between BMI and smoking status significantly improves the model and changes how BMI and smoking should be interpreted. Rather than having separate independent effects, smoking amplifies the impact of BMI on medical costs. The interaction model provides a better statistical fit and a more realistic explanation of variation in insurance charges.
Model R2 Adj_R2 Residual_SE
Full Full 0.7509130 0.7494136 6062.1022885
Reduced Reduced 0.7508839 0.7495727 6060.1775001
Log Log 0.7662799 0.7650497 0.4457101
Log_Reduced Log_Reduced 0.9084246 0.9079013 0.2718137
Interaction Interaction 0.7813001 0.7799836 0.4313125
eventhough log_reduced has better prediction power, I would like to go with Interaction Model because , we can keep influential observations without sacrificing accuracy. The interaction terms allow the model to account for complex relationships that simple models ignore, resulting in a significantly better fit and lower prediction error.
log_pred <-predict(interaction_model, newdata = new_obs)final_prediction <-exp(log_pred) #this would help me to communicate my modelprint(final_prediction)
1 2
3538.986 38102.805
Observation 1 ($3,538.99): Age 25 with no children, non-smoker, 22.5 bmi and living in Northwest. This person is predicted to have relatively low insurance costs based on my model.
Observation 2 ($38,102.81): Age 45 having 2 kids, bmi 30, smoker and living in Southwest region. This person is predicted to cost nearly 11 times more than the first person.