1. Bussiness Problem

Which Factors Influence the Price of Health Insurance?

Many factors that affect how much you pay for health insurance are not within your control. Nonetheless, it’s good to have an understanding of what they are. Here are some factors that affect how much health insurance premiums cost

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest

2. Data Preparation

2.1. Import Library and Load data

# import library
library(dplyr) # data wrangling
library(GGally) # correlation variables
library(MLmetrics) # model evaluation

# assumption test
library(car)
library(lmtest)


# EDA
library(inspectdf)
library(performance)
library(scales)

# plot
library(plotly)
#library(tidymodels)
library(data.table)
#library(stats) # do step() function

# load data
insurance <- fread("input_data/insurance.csv")

rmarkdown::paged_table(insurance)

# check data

2.2. Check Data Structure

# Check data type
str(insurance)

## Classes 'data.table' and 'data.frame':   1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We change data type to factor for column sex, smoker, region, and children

# Change data type
insurance <- insurance %>% 
  mutate_at(vars(sex, smoker, region, children), as.factor)

# re-check adjusted data type
glimpse(insurance)

## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex      <fct> female, male, male, male, male, female, female, female, male,~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <fct> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
## $ region   <fct> southwest, southeast, southeast, northwest, northwest, southe~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

2.3. Check Missing Value

# check missing value
anyNA(insurance)

## [1] FALSE

insurance %>% 
  is.na() %>% 
  colSums()

##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

There is no missing value in the data.

2.4. Check Outlier Data

We set for variable target (charges) and predictor (age, sex, bmi, children, smoker, and region).

# check for target variable
boxplot(insurance$charges)

# check for predictor variable (numeric)
boxplot(insurance$age, insurance$bmi)

There are outlier in the data variables. They are charges and bmi.

2.5. Check Summary data

summary(insurance)

##       age            sex           bmi        children smoker    
##  Min.   :18.00   female:662   Min.   :15.96   0:574    no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1:324    yes: 274  
##  Median :39.00                Median :30.40   2:240              
##  Mean   :39.21                Mean   :30.66   3:157              
##  3rd Qu.:51.00                3rd Qu.:34.69   4: 25              
##  Max.   :64.00                Max.   :53.13   5: 18              
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

3. Exploratory Data Analysis

3.1. Check corelation between variable predictor and target

# check corelation between variables `bmi` and `charges`
cor(x = insurance$bmi, y = insurance$charges)

## [1] 0.198341

# check corelation between variables `age` and `charges`
cor(x = insurance$age, y = insurance$charges)

## [1] 0.2990082

insurance_ols <- lm(formula = charges~bmi, data = insurance)
summary(insurance_ols)

## 
## Call:
## lm(formula = charges ~ bmi, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20956  -8118  -3757   4722  49442 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)  1192.94    1664.80   0.717             0.474    
## bmi           393.87      53.25   7.397 0.000000000000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared:  0.03934,    Adjusted R-squared:  0.03862 
## F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

bmi = insurance$bmi
charges = insurance$charges
plot(x = bmi, y = charges, main = "Correlation between predictor-bmi & target-charges")
abline(insurance_ols, col = "red")

Correlation between bmi (predictor) and charges (target) is tends to be weak as it gets closer to 0

3.2. Check distribution data and pattern data

We do check distribution data for each variable by using hist() function like on below.

hist(insurance$age,
     breaks = 10, main = "Distribution data for Age")

Based on histogram above, the distribution of ‘age’ data tends to be many among the frequencies of 100 - 150.

hist(insurance$bmi,
     breaks = 10, main = "Distribution data for bmi")

Based on histogram above, the distribution of ‘bmi’ data tends to be many from 25-35 with frequency around 380 - 390.

plot(insurance$sex, main = "Distribution data for Sex (Gender)")

Based on the plot above, the distribution data for variable sex (Gender) are female (662) and male (676).

plot(insurance$children, main = "Distribution data for Children")

Based on the plot above, the distribution data for variable children are :
- 0 : 574
- 1 : 324
- 2 : 240
- 3 : 157
- 4 : 25
- 5 : 18

plot(insurance$smoker, main = "Distribution data for Smoker")

Based on the plot above, the distribution data for variable smoker are not smoker (1064) and smoker (274).

plot(insurance$region, main = "Distribution data for Region")

Based on the plot above, the distribution data for variable region are : - northeast : 324 - northwest : 325 - southeast : 364 - southwest : 325

4. Modelling

# model with all predictor variable
insurance_all <- lm(formula = charges~., data = insurance)
summary(insurance_all)

## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11689.4  -2902.6   -943.7   1492.2  30042.7 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     -11927.17     993.66 -12.003 < 0.0000000000000002 ***
## age                257.19      11.91  21.587 < 0.0000000000000002 ***
## sexmale           -128.16     332.83  -0.385             0.700254    
## bmi                336.91      28.61  11.775 < 0.0000000000000002 ***
## children1          390.98     421.35   0.928             0.353619    
## children2         1635.78     466.67   3.505             0.000471 ***
## children3          964.34     548.10   1.759             0.078735 .  
## children4         2947.37    1239.16   2.379             0.017524 *  
## children5         1116.04    1456.02   0.767             0.443514    
## smokeryes        23836.41     414.14  57.557 < 0.0000000000000002 ***
## regionnorthwest   -380.04     476.56  -0.797             0.425318    
## regionsoutheast  -1033.14     479.14  -2.156             0.031245 *  
## regionsouthwest   -952.89     478.15  -1.993             0.046483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7497 
## F-statistic: 334.7 on 12 and 1325 DF,  p-value: < 0.00000000000000022

From the model on above, we get Adjusted R-squared value (0.7497) and p-value < 0.05, so : - the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)

the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
the predictor variable doesn’t have significant positive/negative influence to target variable (such as sexmale, children1, children3, children5, regionnorthwest)

5. Evaluation

summary(insurance_ols)

## 
## Call:
## lm(formula = charges ~ bmi, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20956  -8118  -3757   4722  49442 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)  1192.94    1664.80   0.717             0.474    
## bmi           393.87      53.25   7.397 0.000000000000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared:  0.03934,    Adjusted R-squared:  0.03862 
## F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

insurance_ols2 <- lm(formula = charges ~ bmi+smoker, data = insurance)
summary(insurance_ols2)

## 
## Call:
## lm(formula = charges ~ bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15992.7  -4600.2   -802.4   3636.2  30677.8 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -3459.10     998.28  -3.465             0.000547 ***
## bmi           388.02      31.79  12.207 < 0.0000000000000002 ***
## smokeryes   23593.98     480.18  49.136 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7088 on 1335 degrees of freedom
## Multiple R-squared:  0.6579, Adjusted R-squared:  0.6574 
## F-statistic:  1284 on 2 and 1335 DF,  p-value: < 0.00000000000000022

head(insurance_ols2$fitted.values)

##         1         2         3         4         5         6 
## 30960.511  9644.179  9345.408  5350.791  7746.785  6528.417

RMSE(y_pred = insurance_ols2$fitted.values, y_true = insurance$charges)

## [1] 7079.981

compare_performance(insurance_ols,
insurance_ols2,
insurance_all)

From the comparison model on above, we get the best model for data insurance is model insurance_all with lowest error. For insurance_ols and insurance_ols2, we can see the more predictors, the more least AIC and RMSE values. Predictor variable smoker has a significant effect on the model insurance_ols2.

6. Model Improvement

6.1. Model Step-Wise Regression

# model backward
model_back <- stats::step(insurance_all, direction = "backward")

## Start:  AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region
## 
##            Df    Sum of Sq          RSS   AIC
## - sex       1      5442885  48644525941 23317
## <none>                      48639083056 23319
## - region    3    226406038  48865489094 23319
## - children  5    637996402  49277079458 23326
## - bmi       1   5089678622  53728761678 23450
## - age       1  17105754278  65744837334 23720
## - smoker    1 121607401409 170246484465 24993
## 
## Step:  AIC=23317.07
## charges ~ age + bmi + children + smoker + region
## 
##            Df    Sum of Sq          RSS   AIC
## <none>                      48644525941 23317
## - region    3    226128753  48870654694 23317
## - children  5    636681365  49281207306 23325
## - bmi       1   5085311935  53729837877 23448
## - age       1  17130650713  65775176654 23719
## - smoker    1 122195213508 170839739450 24996

summary(model_back)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11620.3  -2883.5   -945.6   1513.0  29986.9 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     -11977.26     984.79 -12.162 < 0.0000000000000002 ***
## age                257.30      11.91  21.609 < 0.0000000000000002 ***
## bmi                336.39      28.57  11.774 < 0.0000000000000002 ***
## children1          388.71     421.17   0.923             0.356211    
## children2         1635.23     466.52   3.505             0.000471 ***
## children3          962.98     547.91   1.758             0.079055 .  
## children4         2938.65    1238.56   2.373             0.017804 *  
## children5         1106.45    1455.33   0.760             0.447227    
## smokeryes        23824.24     412.80  57.714 < 0.0000000000000002 ***
## regionnorthwest   -379.44     476.40  -0.796             0.425908    
## regionsoutheast  -1032.43     478.98  -2.155             0.031304 *  
## regionsouthwest   -952.16     478.00  -1.992             0.046577 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6057 on 1326 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7498 
## F-statistic: 365.3 on 11 and 1326 DF,  p-value: < 0.00000000000000022

From the model backward above, we get :

p-value (0.00000000000000022) < 0.05, so :
- the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
- the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
- the predictor variable doesn’t have significant positive/negative influence to target variable (such as children1, children3, children5, regionnorthwest)
Adjusted R-squared : 0.7498
AIC : AIC=23317.07

# model forward
model_forward <- stats::step(insurance_all, direction = "forward")

## Start:  AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region

summary(model_forward)

## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11689.4  -2902.6   -943.7   1492.2  30042.7 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     -11927.17     993.66 -12.003 < 0.0000000000000002 ***
## age                257.19      11.91  21.587 < 0.0000000000000002 ***
## sexmale           -128.16     332.83  -0.385             0.700254    
## bmi                336.91      28.61  11.775 < 0.0000000000000002 ***
## children1          390.98     421.35   0.928             0.353619    
## children2         1635.78     466.67   3.505             0.000471 ***
## children3          964.34     548.10   1.759             0.078735 .  
## children4         2947.37    1239.16   2.379             0.017524 *  
## children5         1116.04    1456.02   0.767             0.443514    
## smokeryes        23836.41     414.14  57.557 < 0.0000000000000002 ***
## regionnorthwest   -380.04     476.56  -0.797             0.425318    
## regionsoutheast  -1033.14     479.14  -2.156             0.031245 *  
## regionsouthwest   -952.89     478.15  -1.993             0.046483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7497 
## F-statistic: 334.7 on 12 and 1325 DF,  p-value: < 0.00000000000000022

From the model forward on above, we get :

p-value (0.00000000000000022) < 0.05, so :
- the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
- the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
- the predictor variable doesn’t have significant positive/negative influence to target variable (such as sexmale, children1, children3, children5, regionnorthwest)
Adjusted R-squared : 0.7497
AIC : 23318.92

# model both
model_both <- stats::step(insurance_all, direction = "both")

## Start:  AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region
## 
##            Df    Sum of Sq          RSS   AIC
## - sex       1      5442885  48644525941 23317
## <none>                      48639083056 23319
## - region    3    226406038  48865489094 23319
## - children  5    637996402  49277079458 23326
## - bmi       1   5089678622  53728761678 23450
## - age       1  17105754278  65744837334 23720
## - smoker    1 121607401409 170246484465 24993
## 
## Step:  AIC=23317.07
## charges ~ age + bmi + children + smoker + region
## 
##            Df    Sum of Sq          RSS   AIC
## <none>                      48644525941 23317
## - region    3    226128753  48870654694 23317
## + sex       1      5442885  48639083056 23319
## - children  5    636681365  49281207306 23325
## - bmi       1   5085311935  53729837877 23448
## - age       1  17130650713  65775176654 23719
## - smoker    1 122195213508 170839739450 24996

summary(model_both)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11620.3  -2883.5   -945.6   1513.0  29986.9 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     -11977.26     984.79 -12.162 < 0.0000000000000002 ***
## age                257.30      11.91  21.609 < 0.0000000000000002 ***
## bmi                336.39      28.57  11.774 < 0.0000000000000002 ***
## children1          388.71     421.17   0.923             0.356211    
## children2         1635.23     466.52   3.505             0.000471 ***
## children3          962.98     547.91   1.758             0.079055 .  
## children4         2938.65    1238.56   2.373             0.017804 *  
## children5         1106.45    1455.33   0.760             0.447227    
## smokeryes        23824.24     412.80  57.714 < 0.0000000000000002 ***
## regionnorthwest   -379.44     476.40  -0.796             0.425908    
## regionsoutheast  -1032.43     478.98  -2.155             0.031304 *  
## regionsouthwest   -952.16     478.00  -1.992             0.046577 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6057 on 1326 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7498 
## F-statistic: 365.3 on 11 and 1326 DF,  p-value: < 0.00000000000000022

From the model both on above, we get :

p-value (0.00000000000000022) < 0.05, so :
- the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
- the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
- the predictor variable doesn’t have significant positive/negative influence to target variable (such as children1, children3, children5, regionnorthwest)
Adjusted R-squared : 0.7498
AIC : 23317.07

6.2. Compare Performance for Model Step-Wise Regression

compare_performance(model_back,
model_forward,
model_both)

From the comparison model above, we get the best model with least AIC (27116.153) is model_back or model_both.

7. Check Assumption Linear Regression

7.1. Linearity

Linearity hypothesis test: H0: the correlation doesn’t significant H1: the correlation does significant

cor.test(x = insurance$bmi, y = insurance$charges)

## 
##  Pearson's product-moment correlation
## 
## data:  insurance$bmi and insurance$charges
## t = 7.3966, df = 1336, p-value = 0.0000000000002459
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1463052 0.2492822
## sample estimates:
##      cor 
## 0.198341

From the result on above, we get p-value (0.0000000000002459) < 0.05 (disagree H0), so the correlation between charges and bmi does significant.

7.2. Normality

Shapiro-Wilk hypothesis test:

H0: error/residuals does normal distributed H1: error/residuals doesn’t normal distributed

# ideal model :  pvalue > alpha (0.05)
shapiro.test(insurance_all$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  insurance_all$residuals
## W = 0.90383, p-value < 0.00000000000000022

shapiro.test(model_back$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_back$residuals
## W = 0.90392, p-value < 0.00000000000000022

From the result on above, we get p-value (0.00000000000000022) < 0.05 (disagree H0), so residuals model doesn’t normal distributed.

7.3. Homoscedasticity

Statistic Test with Breusch-Pagan from package lmtest

Hypothesis : H0: Variance of spread error is constant (Homoscedasticity) H1: Variance of spread error is not constant / forms a pattern (Heteroscedasticity)

# ideal model : p-value > alpha (0.05) -> Homoscedasticity 
bptest(insurance_all)

## 
##  studentized Breusch-Pagan test
## 
## data:  insurance_all
## BP = 132.47, df = 12, p-value < 0.00000000000000022

bptest(model_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 131.59, df = 11, p-value < 0.00000000000000022

From the result on above, we get p-value (0.00000000000000022) < 0.05 (disagree H0). Then, the residual is heterogenous or the model does not Homoscedasticity.

plot(model_back$fitted.values, model_back$residuals, ylim = c(-50,50))
abline(h = 0, col = "red")

plot(insurance_all$fitted.values, insurance_all$residuals, ylim = c(-50,50))
abline(h = 0, col = "red")

Based on the scatterplot for both model insurance_all and model_back, the spread of error / residuals is random.

7.4. Multicolinearity

# ideal model : VIF(Variance Inflation Factor) < 10
vif(insurance_all)

##              GVIF Df GVIF^(1/(2*Df))
## age      1.020607  1        1.010251
## sex      1.009334  1        1.004656
## bmi      1.108836  1        1.053013
## children 1.024871  5        1.002460
## smoker   1.018024  1        1.008972
## region   1.109919  3        1.017533

vif(model_back)

##              GVIF Df GVIF^(1/(2*Df))
## age      1.020006  1        1.009954
## bmi      1.106363  1        1.051838
## children 1.024128  5        1.002387
## smoker   1.012092  1        1.006028
## region   1.109896  3        1.017530

Based on the result on above, we get VIF < 10. So, there is no multicolinearity in the model above.

8. Conclusion

Variables that are useful to describe the variances in insurance charges are age, bmi, and smoker. Our final model has satisfied the classical assumptions. The R-squared of the model is good enough, with 74,98% of the variables can explain the variances in the insurance charges. The accuracy of the model in predicting the insurance charges is measured with RMSE of 6029.269. We have already learn how to build a linear regression model and what need to be concerned when building the model.

9. Reference

I get data insurance from https://www.kaggle.com/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

LBB-RM-Melia Sie

Melia Sie

12/12/2021