To realize their profit, insurance companies must charge a higher premium than the amount paid to the insured. For this reason, insurance companies invest a lot of time, effort, and money in creating models that are able to accurately predict health care costs. In order to fulfill this mission, we will first analyze the factors that influence medical loads and secondly try to build an adequate model and optimize its performance.
insurances <- read.csv("insurance.csv")library(ggthemes)
library(tidyverse)
library(ggridges)
library(magrittr)
library(gridExtra)
library(patchwork)
library(tidymodels)
library(car)
library(caret)
library(MLmetrics)
library(lmtest)
theme_set(theme_minimal())glimpse(insurances)## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
We have a dataset that includes 1338 observations on 7 variables
We will convert the sex, children, region, smoker variables to the type factor to make it categorical:
insurances$sex %<>% as.factor()
insurances$children %<>% as.factor()
insurances$region %<>% as.factor()
insurances$smoker %<>% as.factor()summary(insurances)## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 0:574 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1:324 yes: 274
## Median :39.00 Median :30.40 2:240
## Mean :39.21 Mean :30.66 3:157
## 3rd Qu.:51.00 3rd Qu.:34.69 4: 25
## Max. :64.00 Max. :53.13 5: 18
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
as we know that bmi can be categoize into under weight, normal, overweight, and obese class 1 to 3
insurances %<>% mutate(bmi_cat = cut(bmi,
breaks = c(0, 18.5, 25, 30, 35, 40, max(insurances$bmi)),
labels = c("Under Weight", "Normal Weight", "Overweight", "Obese 1", "Obese 2", "Obese 3")
))Observations :
* 67% of individuals have at most 1 child, 85% have at most 2. * most of the respondents have overweight to obese in BMI category. * most of the respondents are smokers (80%). * the dataset is balanced with respect to sex and region.
library(inspectdf)
show_plot(inspect_cat(insurances))Note: Individuals with more than 3 children and those underweight are negligible in our sample so we will not take them into account for the comparison.
## age sex bmi children smoker region charges bmi_cat
## 0 0 0 0 0 0 0 0
the dataset contains no missing values! so we are ready to continue to EDA process.
The variables most correlated with the charges are “smoker”, “age” and “bmi” respectively.
insurances %>%
dplyr::select(age, bmi, smoker, charges) %>%
GGally::ggpairs(
lower = list(
continuous = GGally::wrap("points", col = "navy",alpha=0.8),
combo = GGally::wrap("box", fill = "white", col ="black")
),
upper = list(
continuous = GGally::wrap("cor", col = "navy"),
combo = GGally::wrap("facetdensity", col = "black")
),
diag = list(
continuous = GGally::wrap(
"barDiag",
fill = "maroon", col ="navy",
bins = 18
),
discrete = GGally::wrap("barDiag", fill = "maroon", col ="navy")
)
) +
theme_bw() Observations :
p1 <- ggplot(data = insurances, aes(x = charges)) +
geom_histogram(aes(y = ..density..),
col = "white",
bins = 25,
fill = "#90b8c2"
) +
geom_density() +
labs(title = "Distribution of charges") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
p2 <- ggplot(data = insurances, aes(y = charges)) +
geom_boxplot(fill = "#90b8c2") +
ggtitle("Charges Boxplot") +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) +
geom_hline(aes(yintercept = mean(charges)),
linetype = "dashed", col = "red")
grid.arrange(p1, p2, ncol = 2, widths = c(5, 2)) * The charges distribution is skewed to the right. * the outliers from this box plot cannot be determined well. So, the data must be transformed logarithmically to centralize the distribution.
p1 <- ggplot(data = insurances, aes(x = log(charges))) +
geom_histogram(aes(y = ..density..),
col = "white",
bins = 25,
fill = "#90b8c2"
) +
geom_density() +
labs(title = "Distribution of charges") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
p2 <- ggplot(data = insurances, aes(y = log(charges))) +
geom_boxplot(fill = "#90b8c2") +
ggtitle("Charges Boxplot") +
theme(
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) +
geom_hline(aes(yintercept = mean(log(charges))),
linetype = "dashed", col = "red")
grid.arrange(p1, p2, ncol = 2, widths = c(5, 2)) It seems that there is no anomaly from the charges distribution.
p1 <- insurances %>%
ggplot(aes(x = age, y = charges, col = bmi_cat)) +
geom_point(alpha = 0.6, size = 2.5)
p2 <- insurances %>%
ggplot(aes(x = age, y = charges, col = smoker)) +
geom_point(alpha = 0.8,size = 2.5) +
scale_color_manual(values = c("#e09e8f", "#90b8c2")) +
geom_rug() +
geom_smooth() +
geom_smooth(
data = filter(insurances, smoker == "yes"),
col = "grey30",
method = lm,
se = FALSE
) +
geom_smooth(
data = filter(insurances, smoker == "no"),
col = "grey30",
method = lm,
se = FALSE
)
grid.arrange(p1, p2, nrow = 1) The charges are linked to age by an almost linear relationship at three levels:
We can also see that for the three levels, the older the customers the higher their charges.
We will use a simple linear model using the two variables age and health.
model_lm = lm(charges ~ ., data = insurances)
summary(model_lm)##
## Call:
## lm(formula = charges ~ ., data = insurances)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12527.2 -3462.3 -234.4 1668.1 28532.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7064.97 2304.33 -3.066 0.002214 **
## age 256.52 11.82 21.707 < 2e-16 ***
## sexmale -157.32 329.74 -0.477 0.633361
## bmi 83.83 104.98 0.798 0.424739
## children1 473.37 417.92 1.133 0.257551
## children2 1614.36 463.60 3.482 0.000513 ***
## children3 894.64 543.59 1.646 0.100043
## children4 3311.99 1231.27 2.690 0.007238 **
## children5 1446.70 1445.96 1.001 0.317247
## smokeryes 23849.89 410.27 58.132 < 2e-16 ***
## regionnorthwest -418.08 472.67 -0.884 0.376589
## regionsoutheast -892.94 475.96 -1.876 0.060861 .
## regionsouthwest -956.22 474.87 -2.014 0.044248 *
## bmi_catNormal Weight 838.33 1471.06 0.570 0.568855
## bmi_catOverweight 1212.05 1709.46 0.709 0.478434
## bmi_catObese 1 4237.21 2049.57 2.067 0.038895 *
## bmi_catObese 2 5111.24 2464.65 2.074 0.038290 *
## bmi_catObese 3 4507.06 3034.27 1.485 0.137680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5999 on 1320 degrees of freedom
## Multiple R-squared: 0.7577, Adjusted R-squared: 0.7546
## F-statistic: 242.9 on 17 and 1320 DF, p-value: < 2.2e-16
as we can see that the adj. R-squared is 75.5%. it is a not bad model, but what can be done from the data to improve the model?
first we can try feature selection using stepwise regression
lm_no_predictors <- lm(formula = charges ~ 1, data = insurances)
model_lm_backward <- stats::step(object = model_lm, direction = "backward", trace = F)
model_lm_forward <- stats::step(object = lm_no_predictors,
scope = list(lower = lm_no_predictors,
upper = model_lm),
direction = "forward",
trace = F)
model_lm_bothStep <- stats::step(object = lm_no_predictors,
scope = list(lower = lm_no_predictors,
upper = model_lm),
direction = "both",
trace = F)
performance::compare_performance(model_lm_backward, model_lm_forward, model_lm_bothStep)summary(model_lm_backward)##
## Call:
## lm(formula = charges ~ age + children + smoker + bmi_cat, data = insurances)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12866.4 -3504.0 -246.6 1652.7 27993.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5983.10 1382.87 -4.327 1.63e-05 ***
## age 257.40 11.81 21.799 < 2e-16 ***
## children1 466.08 417.54 1.116 0.264507
## children2 1617.77 462.39 3.499 0.000483 ***
## children3 920.60 543.02 1.695 0.090247 .
## children4 3373.35 1229.64 2.743 0.006163 **
## children5 1291.95 1440.57 0.897 0.369971
## smokeryes 23819.51 408.08 58.369 < 2e-16 ***
## bmi_catNormal Weight 1069.70 1372.15 0.780 0.435777
## bmi_catOverweight 1788.95 1350.31 1.325 0.185452
## bmi_catObese 1 5180.10 1350.23 3.836 0.000131 ***
## bmi_catObese 2 6347.56 1376.34 4.612 4.38e-06 ***
## bmi_catObese 3 6209.18 1457.60 4.260 2.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6001 on 1325 degrees of freedom
## Multiple R-squared: 0.7567, Adjusted R-squared: 0.7545
## F-statistic: 343.4 on 12 and 1325 DF, p-value: < 2.2e-16
using stepwise regression seems to do nothing to improve the model.
Now, we will try to add new variable that combine BMI category and whether the patient is a smoker or not. because as we can see from the EDA above that BMI and smoker has high correlation with the charges. and we can also assume that smoking and high BMI could cause high cost diseases.
insurances %<>%
mutate(
health = case_when(
smoker == "yes" & str_detect(bmi_cat, "Obese 1", negate = F) ~ "Obese 1 and smoker",
smoker == "yes" & str_detect(bmi_cat, "Obese 2", negate = F) ~ "Obese 2 and smoker",
smoker == "yes" & str_detect(bmi_cat, "Obese 3", negate = F) ~ "Obese 3 and smoker",
smoker == "yes" & str_detect(bmi_cat, "Over", negate = F) ~ "Overweight and smoker",
smoker == "yes" & str_detect(bmi_cat, "Normal", negate = F) ~ "Normal wight and smoker",
smoker == "yes" & str_detect(bmi_cat, "Under", negate = F) ~ "Under wight and smoker",
smoker == "no" ~ "Non smoker"
)
)
insurances$health %<>% as.factor()model_lm_ext <- lm(formula = charges ~ age + health + children + region + sex,
data = insurances)
summary(model_lm_ext)##
## Call:
## lm(formula = charges ~ age + health + children + region + sex,
## data = insurances)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4077.1 -1812.7 -1224.6 -541.5 25032.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1770.33 457.67 -3.868 0.000115 ***
## age 265.49 8.63 30.763 < 2e-16 ***
## healthNormal wight and smoker 11683.07 641.88 18.201 < 2e-16 ***
## healthObese 1 and smoker 31830.16 536.27 59.355 < 2e-16 ***
## healthObese 2 and smoker 34390.72 644.26 53.380 < 2e-16 ***
## healthObese 3 and smoker 36923.53 985.45 37.469 < 2e-16 ***
## healthOverweight and smoker 14433.95 529.28 27.271 < 2e-16 ***
## healthUnder wight and smoker 12089.55 2006.00 6.027 2.17e-09 ***
## children1 441.38 307.27 1.436 0.151106
## children2 1310.89 342.31 3.830 0.000134 ***
## children3 1162.53 399.26 2.912 0.003655 **
## children4 3678.74 902.75 4.075 4.87e-05 ***
## children5 1493.52 1066.75 1.400 0.161731
## regionnorthwest -349.18 347.95 -1.004 0.315788
## regionsoutheast -783.84 341.38 -2.296 0.021828 *
## regionsouthwest -1128.38 348.29 -3.240 0.001226 **
## sexmale -505.60 242.97 -2.081 0.037636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4413 on 1321 degrees of freedom
## Multiple R-squared: 0.8688, Adjusted R-squared: 0.8672
## F-statistic: 546.8 on 16 and 1321 DF, p-value: < 2.2e-16
As we can see, now the model has improve quite well when we add new feature which is emphasize the health condition of patient based on the smoking and BMI variable.
On this first model: * we obtained an adjusted R-squared of 0.86, which means that 86% of the variation in charges could be explained by the two independent variables age and health. * it is also observed that all the independent variables are statistically significant predictors of medical costs (p.value less than 0.05 <- level of significance). * The average charge for an 18-year-old non-smoking insured is 2693.654. * One year increase in age increases the charges by 268,437 for the same category of health. * Compared to non-smokers, controlling for age: + Non-obese smokers have on average 13343.999 more charges. + Obese smokers have on average 33334.018 more loads.
hist(model_lm_ext$residuals)plot(density(model_lm_ext$residuals))Shapiro-Wilk hypothesis test: H0: error/residual is normally distributed H1: error/residual is not normally distributed
decision: * rejects H0 if p-value < alpha
shapiro.test(model_lm_ext$residuals)##
## Shapiro-Wilk normality test
##
## data: model_lm_ext$residuals
## W = 0.48685, p-value < 2.2e-16
Result: p-value is much lower than alpha (0.05), so we can assume that the residual is not normally distributed.
plot(model_lm_ext$fitted.values, model_lm_ext$residuals)
abline(h = 0, col = "red")H0: Error variances is spreading randomly (Homoscedasticity) H1: Error variances is not spreading randomly or make a pattern (Heteroscedasticity)
decision: * rejects H0 if p-value < alpha
library(lmtest)
bptest(model_lm_ext)##
## studentized Breusch-Pagan test
##
## data: model_lm_ext
## BP = 18.583, df = 16, p-value = 0.2909
Result: p-value is much bigger than alpha (0.2909 > 0.05), so we failed to reject null hypothesis and we can assume that the error is spreading randomly and constantly (Homoscedasticity).
To obtain even more precision in its predictions, this insurance company should collect more data about its customers in order to explain the behavior of some individuals that we have noticed - in the visualization phase - who do not follow the same trend as the others.