Introduction

To realize their profit, insurance companies must charge a higher premium than the amount paid to the insured. For this reason, insurance companies invest a lot of time, effort, and money in creating models that are able to accurately predict health care costs. In order to fulfill this mission, we will first analyze the factors that influence medical loads and secondly try to build an adequate model and optimize its performance.

Exploratory data analysis

Importing data

insurances <- read.csv("insurance.csv")

Importing the necessary tools

library(ggthemes)
library(tidyverse)
library(ggridges)
library(magrittr)
library(gridExtra)
library(patchwork)
library(tidymodels)
library(car)
library(caret)
library(MLmetrics)
library(lmtest)
theme_set(theme_minimal())

Overview of the data

glimpse(insurances)
## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

We have a dataset that includes 1338 observations on 7 variables

  • Age: the age of the insured (recipients).
  • Sex: sex of insured persons; “male” or “female”.
  • bmi: body mass index, providing an understanding of the body, relatively high or low weights relative to height, objective body weight index (kg / m ^ 2) using the height / weight ratio.
  • children: number of children covered by health insurance / number of dependents.
  • smoker: does the insured smoke or not.
  • region: the recipient’s residential area in the United States; northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance.

We will convert the sex, children, region, smoker variables to the type factor to make it categorical:

insurances$sex %<>% as.factor()
insurances$children %<>% as.factor()
insurances$region %<>% as.factor()
insurances$smoker %<>% as.factor()

Description of variables

summary(insurances)
##       age            sex           bmi        children smoker    
##  Min.   :18.00   female:662   Min.   :15.96   0:574    no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1:324    yes: 274  
##  Median :39.00                Median :30.40   2:240              
##  Mean   :39.21                Mean   :30.66   3:157              
##  3rd Qu.:51.00                3rd Qu.:34.69   4: 25              
##  Max.   :64.00                Max.   :53.13   5: 18              
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Categorize the body mass index (BMI)

as we know that bmi can be categoize into under weight, normal, overweight, and obese class 1 to 3

insurances %<>% mutate(bmi_cat = cut(bmi,
  breaks = c(0, 18.5, 25, 30, 35, 40, max(insurances$bmi)),
  labels = c("Under Weight", "Normal Weight", "Overweight", "Obese 1", "Obese 2", "Obese 3")
))
  • Under Weight: bmi<18.5
  • Normal Weight: 18.5<bmi<25
  • Overweight : 25<=bmi<30
  • Obese 1: 30<=bmi<35
  • Obese 2: 35<=bmi<40
  • Obese 3: bmi>=40

Description of categorical variables

Observations :
* 67% of individuals have at most 1 child, 85% have at most 2. * most of the respondents have overweight to obese in BMI category. * most of the respondents are smokers (80%). * the dataset is balanced with respect to sex and region.

Proportions of categories in categorical variables

library(inspectdf)
show_plot(inspect_cat(insurances))

Note: Individuals with more than 3 children and those underweight are negligible in our sample so we will not take them into account for the comparison.

Missing values

##      age      sex      bmi children   smoker   region  charges  bmi_cat 
##        0        0        0        0        0        0        0        0


the dataset contains no missing values! so we are ready to continue to EDA process.

Visualisation

Correlation matrix

The variables most correlated with the charges are “smoker”, “age” and “bmi” respectively.

Visualize the interactions between variables

insurances %>%
  dplyr::select(age, bmi, smoker, charges) %>%
  GGally::ggpairs(
    lower = list(
      continuous = GGally::wrap("points", col = "navy",alpha=0.8),
      combo = GGally::wrap("box", fill = "white", col ="black")
    ),
    upper = list(
      continuous = GGally::wrap("cor", col = "navy"),
      combo = GGally::wrap("facetdensity", col = "black")
    ),
    diag = list(
      continuous = GGally::wrap(
        "barDiag",
        fill = "maroon", col ="navy",
        bins = 18
      ),
      discrete = GGally::wrap("barDiag", fill = "maroon", col ="navy")
    )
  ) +
  theme_bw()

Observations :

  • The distributions of variables :
    • The age distribution of individuals is relatively the same, with a higher proportion on the age group 18 to 20 years old.
    • The distribution of BMI is apparently normally distributed centered around 30, which is Obese level 1.
    • The distribution of charges is skewed positively.
  • We notice an effect of these variables on the charges, which we will explore more in depth later.
  • No significant dependency between: age & bmi, smoker & bmi, and age & smoker.

Distribution of charges

p1 <- ggplot(data = insurances, aes(x = charges)) +
  geom_histogram(aes(y = ..density..),
    col = "white",
    bins = 25,
    fill = "#90b8c2"
  ) +
  geom_density() +
  labs(title = "Distribution of charges") +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

p2 <- ggplot(data = insurances, aes(y = charges)) +
  geom_boxplot(fill = "#90b8c2") +
  ggtitle("Charges Boxplot") +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  ) +
  geom_hline(aes(yintercept = mean(charges)),
             linetype = "dashed", col = "red")

grid.arrange(p1, p2, ncol = 2, widths = c(5, 2))

* The charges distribution is skewed to the right. * the outliers from this box plot cannot be determined well. So, the data must be transformed logarithmically to centralize the distribution.

p1 <- ggplot(data = insurances, aes(x = log(charges))) +
  geom_histogram(aes(y = ..density..),
    col = "white",
    bins = 25,
    fill = "#90b8c2"
  ) +
  geom_density() +
  labs(title = "Distribution of charges") +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

p2 <- ggplot(data = insurances, aes(y = log(charges))) +
  geom_boxplot(fill = "#90b8c2") +
  ggtitle("Charges Boxplot") +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  ) +
  geom_hline(aes(yintercept = mean(log(charges))),
             linetype = "dashed", col = "red")

grid.arrange(p1, p2, ncol = 2, widths = c(5, 2))

It seems that there is no anomaly from the charges distribution.

Interactions between age, bmi and smoking and their impact on medical charges

p1 <- insurances %>%
  ggplot(aes(x = age, y = charges, col = bmi_cat)) +
  geom_point(alpha = 0.6, size = 2.5)

p2 <- insurances %>%
  ggplot(aes(x = age, y = charges, col = smoker)) +
  geom_point(alpha = 0.8,size = 2.5) +
  scale_color_manual(values = c("#e09e8f", "#90b8c2")) +
  geom_rug() +
  geom_smooth() +
  geom_smooth(
    data = filter(insurances, smoker == "yes"),
    col = "grey30",
    method = lm,
    se = FALSE
  ) +
  geom_smooth(
    data = filter(insurances, smoker == "no"),
    col = "grey30",
    method = lm,
    se = FALSE
  )

grid.arrange(p1, p2, nrow = 1)

The charges are linked to age by an almost linear relationship at three levels:

  • a first group which is characterized by the highest charges, it is completely made up of obese smoker individuals.
  • a second group which is characterized by the lowest charges, it consists entirely of non-smoking individuals and a normal bmi distribution.
  • and a third non-homogeneous group which requires more exploration.

We can also see that for the three levels, the older the customers the higher their charges.

Modelization

Linear model

Model initialization

We will use a simple linear model using the two variables age and health.

model_lm = lm(charges ~ ., data = insurances)
summary(model_lm)
## 
## Call:
## lm(formula = charges ~ ., data = insurances)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12527.2  -3462.3   -234.4   1668.1  28532.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7064.97    2304.33  -3.066 0.002214 ** 
## age                    256.52      11.82  21.707  < 2e-16 ***
## sexmale               -157.32     329.74  -0.477 0.633361    
## bmi                     83.83     104.98   0.798 0.424739    
## children1              473.37     417.92   1.133 0.257551    
## children2             1614.36     463.60   3.482 0.000513 ***
## children3              894.64     543.59   1.646 0.100043    
## children4             3311.99    1231.27   2.690 0.007238 ** 
## children5             1446.70    1445.96   1.001 0.317247    
## smokeryes            23849.89     410.27  58.132  < 2e-16 ***
## regionnorthwest       -418.08     472.67  -0.884 0.376589    
## regionsoutheast       -892.94     475.96  -1.876 0.060861 .  
## regionsouthwest       -956.22     474.87  -2.014 0.044248 *  
## bmi_catNormal Weight   838.33    1471.06   0.570 0.568855    
## bmi_catOverweight     1212.05    1709.46   0.709 0.478434    
## bmi_catObese 1        4237.21    2049.57   2.067 0.038895 *  
## bmi_catObese 2        5111.24    2464.65   2.074 0.038290 *  
## bmi_catObese 3        4507.06    3034.27   1.485 0.137680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5999 on 1320 degrees of freedom
## Multiple R-squared:  0.7577, Adjusted R-squared:  0.7546 
## F-statistic: 242.9 on 17 and 1320 DF,  p-value: < 2.2e-16

as we can see that the adj. R-squared is 75.5%. it is a not bad model, but what can be done from the data to improve the model?

first we can try feature selection using stepwise regression

Stepwise Regression (Feature Selection)

lm_no_predictors <- lm(formula = charges ~ 1, data = insurances)
model_lm_backward <- stats::step(object = model_lm, direction = "backward", trace = F)
model_lm_forward <- stats::step(object = lm_no_predictors,
                     scope = list(lower = lm_no_predictors,
                                  upper = model_lm),
                     direction = "forward", 
                     trace = F)
model_lm_bothStep <- stats::step(object = lm_no_predictors,
                     scope = list(lower = lm_no_predictors,
                                  upper = model_lm),
                     direction = "both", 
                     trace = F)

performance::compare_performance(model_lm_backward, model_lm_forward, model_lm_bothStep)
summary(model_lm_backward)
## 
## Call:
## lm(formula = charges ~ age + children + smoker + bmi_cat, data = insurances)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12866.4  -3504.0   -246.6   1652.7  27993.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5983.10    1382.87  -4.327 1.63e-05 ***
## age                    257.40      11.81  21.799  < 2e-16 ***
## children1              466.08     417.54   1.116 0.264507    
## children2             1617.77     462.39   3.499 0.000483 ***
## children3              920.60     543.02   1.695 0.090247 .  
## children4             3373.35    1229.64   2.743 0.006163 ** 
## children5             1291.95    1440.57   0.897 0.369971    
## smokeryes            23819.51     408.08  58.369  < 2e-16 ***
## bmi_catNormal Weight  1069.70    1372.15   0.780 0.435777    
## bmi_catOverweight     1788.95    1350.31   1.325 0.185452    
## bmi_catObese 1        5180.10    1350.23   3.836 0.000131 ***
## bmi_catObese 2        6347.56    1376.34   4.612 4.38e-06 ***
## bmi_catObese 3        6209.18    1457.60   4.260 2.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6001 on 1325 degrees of freedom
## Multiple R-squared:  0.7567, Adjusted R-squared:  0.7545 
## F-statistic: 343.4 on 12 and 1325 DF,  p-value: < 2.2e-16

using stepwise regression seems to do nothing to improve the model.

Now, we will try to add new variable that combine BMI category and whether the patient is a smoker or not. because as we can see from the EDA above that BMI and smoker has high correlation with the charges. and we can also assume that smoking and high BMI could cause high cost diseases.

insurances %<>%
  mutate(
    health = case_when(
      smoker == "yes" & str_detect(bmi_cat, "Obese 1", negate = F) ~ "Obese 1 and smoker",
      smoker == "yes" & str_detect(bmi_cat, "Obese 2", negate = F) ~ "Obese 2 and smoker",
      smoker == "yes" & str_detect(bmi_cat, "Obese 3", negate = F) ~ "Obese 3 and smoker",
      smoker == "yes" & str_detect(bmi_cat, "Over", negate = F) ~ "Overweight and smoker",
      smoker == "yes" & str_detect(bmi_cat, "Normal", negate = F) ~ "Normal wight and smoker",
      smoker == "yes" & str_detect(bmi_cat, "Under", negate = F) ~ "Under wight and smoker",
      smoker == "no" ~ "Non smoker"
    )
  )
insurances$health %<>% as.factor()
model_lm_ext <- lm(formula = charges ~ age + health + children + region + sex, 
    data = insurances)
summary(model_lm_ext)
## 
## Call:
## lm(formula = charges ~ age + health + children + region + sex, 
##     data = insurances)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4077.1 -1812.7 -1224.6  -541.5 25032.9 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1770.33     457.67  -3.868 0.000115 ***
## age                             265.49       8.63  30.763  < 2e-16 ***
## healthNormal wight and smoker 11683.07     641.88  18.201  < 2e-16 ***
## healthObese 1 and smoker      31830.16     536.27  59.355  < 2e-16 ***
## healthObese 2 and smoker      34390.72     644.26  53.380  < 2e-16 ***
## healthObese 3 and smoker      36923.53     985.45  37.469  < 2e-16 ***
## healthOverweight and smoker   14433.95     529.28  27.271  < 2e-16 ***
## healthUnder wight and smoker  12089.55    2006.00   6.027 2.17e-09 ***
## children1                       441.38     307.27   1.436 0.151106    
## children2                      1310.89     342.31   3.830 0.000134 ***
## children3                      1162.53     399.26   2.912 0.003655 ** 
## children4                      3678.74     902.75   4.075 4.87e-05 ***
## children5                      1493.52    1066.75   1.400 0.161731    
## regionnorthwest                -349.18     347.95  -1.004 0.315788    
## regionsoutheast                -783.84     341.38  -2.296 0.021828 *  
## regionsouthwest               -1128.38     348.29  -3.240 0.001226 ** 
## sexmale                        -505.60     242.97  -2.081 0.037636 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4413 on 1321 degrees of freedom
## Multiple R-squared:  0.8688, Adjusted R-squared:  0.8672 
## F-statistic: 546.8 on 16 and 1321 DF,  p-value: < 2.2e-16

As we can see, now the model has improve quite well when we add new feature which is emphasize the health condition of patient based on the smoking and BMI variable.

On this first model: * we obtained an adjusted R-squared of 0.86, which means that 86% of the variation in charges could be explained by the two independent variables age and health. * it is also observed that all the independent variables are statistically significant predictors of medical costs (p.value less than 0.05 <- level of significance). * The average charge for an 18-year-old non-smoking insured is 2693.654. * One year increase in age increases the charges by 268,437 for the same category of health. * Compared to non-smokers, controlling for age: + Non-obese smokers have on average 13343.999 more charges. + Obese smokers have on average 33334.018 more loads.

Verification of the regression assumptions for the initial model

Verification of the Normality Residual.

  1. Visualize the residual
hist(model_lm_ext$residuals)

plot(density(model_lm_ext$residuals))

  1. Statistic test using Shapiro-Wilk test:

Shapiro-Wilk hypothesis test: H0: error/residual is normally distributed H1: error/residual is not normally distributed

decision: * rejects H0 if p-value < alpha

shapiro.test(model_lm_ext$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_lm_ext$residuals
## W = 0.48685, p-value < 2.2e-16

Result: p-value is much lower than alpha (0.05), so we can assume that the residual is not normally distributed.

Homoscedasticity

  1. Visualize using scatter plot
plot(model_lm_ext$fitted.values, model_lm_ext$residuals)
abline(h = 0, col = "red")

  1. Statistical test using Breusch-Pagan test:

H0: Error variances is spreading randomly (Homoscedasticity) H1: Error variances is not spreading randomly or make a pattern (Heteroscedasticity)

decision: * rejects H0 if p-value < alpha

library(lmtest)
bptest(model_lm_ext)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_lm_ext
## BP = 18.583, df = 16, p-value = 0.2909

Result: p-value is much bigger than alpha (0.2909 > 0.05), so we failed to reject null hypothesis and we can assume that the error is spreading randomly and constantly (Homoscedasticity).

Conclusion

To obtain even more precision in its predictions, this insurance company should collect more data about its customers in order to explain the behavior of some individuals that we have noticed - in the visualization phase - who do not follow the same trend as the others.