Many factors that affect how much you pay for health insurance are not within your control. Nonetheless, it’s good to have an understanding of what they are. Here are some factors that affect how much health insurance premiums cost
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest
# import library
library(dplyr) # data wrangling
library(GGally) # correlation variables
library(MLmetrics) # model evaluation
# assumption test
library(car)
library(lmtest)
# EDA
library(inspectdf)
library(performance)
library(scales)
# plot
library(plotly)
#library(tidymodels)
library(data.table)
#library(stats) # do step() function# load data
insurance <- fread("input_data/insurance.csv")
rmarkdown::paged_table(insurance)# check data# Check data type
str(insurance)## Classes 'data.table' and 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
## - attr(*, ".internal.selfref")=<externalptr>
We change data type to factor for column sex, smoker, region, and children
# Change data type
insurance <- insurance %>%
mutate_at(vars(sex, smoker, region, children), as.factor)
# re-check adjusted data type
glimpse(insurance)## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <fct> female, male, male, male, male, female, female, female, male,~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <fct> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
## $ region <fct> southwest, southeast, southeast, northwest, northwest, southe~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
# check missing value
anyNA(insurance)## [1] FALSE
insurance %>%
is.na() %>%
colSums()## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
There is no missing value in the data.
We set for variable target (charges) and predictor (age, sex, bmi, children, smoker, and region).
# check for target variable
boxplot(insurance$charges)# check for predictor variable (numeric)
boxplot(insurance$age, insurance$bmi)There are outlier in the data variables. They are charges and bmi.
summary(insurance)## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 0:574 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1:324 yes: 274
## Median :39.00 Median :30.40 2:240
## Mean :39.21 Mean :30.66 3:157
## 3rd Qu.:51.00 3rd Qu.:34.69 4: 25
## Max. :64.00 Max. :53.13 5: 18
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
# check corelation between variables `bmi` and `charges`
cor(x = insurance$bmi, y = insurance$charges)## [1] 0.198341
# check corelation between variables `age` and `charges`
cor(x = insurance$age, y = insurance$charges)## [1] 0.2990082
insurance_ols <- lm(formula = charges~bmi, data = insurance)
summary(insurance_ols)##
## Call:
## lm(formula = charges ~ bmi, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 0.000000000000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
bmi = insurance$bmi
charges = insurance$charges
plot(x = bmi, y = charges, main = "Correlation between predictor-bmi & target-charges")
abline(insurance_ols, col = "red")Correlation between bmi (predictor) and charges (target) is tends to be weak as it gets closer to 0
We do check distribution data for each variable by using hist() function like on below.
hist(insurance$age,
breaks = 10, main = "Distribution data for Age")Based on histogram above, the distribution of ‘age’ data tends to be many among the frequencies of 100 - 150.
hist(insurance$bmi,
breaks = 10, main = "Distribution data for bmi")Based on histogram above, the distribution of ‘bmi’ data tends to be many from 25-35 with frequency around 380 - 390.
plot(insurance$sex, main = "Distribution data for Sex (Gender)")Based on the plot above, the distribution data for variable sex (Gender) are female (662) and male (676).
plot(insurance$children, main = "Distribution data for Children")Based on the plot above, the distribution data for variable children are :
- 0 : 574
- 1 : 324
- 2 : 240
- 3 : 157
- 4 : 25
- 5 : 18
plot(insurance$smoker, main = "Distribution data for Smoker")Based on the plot above, the distribution data for variable smoker are not smoker (1064) and smoker (274).
plot(insurance$region, main = "Distribution data for Region")Based on the plot above, the distribution data for variable region are : - northeast : 324 - northwest : 325 - southeast : 364 - southwest : 325
# model with all predictor variable
insurance_all <- lm(formula = charges~., data = insurance)
summary(insurance_all)##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11689.4 -2902.6 -943.7 1492.2 30042.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11927.17 993.66 -12.003 < 0.0000000000000002 ***
## age 257.19 11.91 21.587 < 0.0000000000000002 ***
## sexmale -128.16 332.83 -0.385 0.700254
## bmi 336.91 28.61 11.775 < 0.0000000000000002 ***
## children1 390.98 421.35 0.928 0.353619
## children2 1635.78 466.67 3.505 0.000471 ***
## children3 964.34 548.10 1.759 0.078735 .
## children4 2947.37 1239.16 2.379 0.017524 *
## children5 1116.04 1456.02 0.767 0.443514
## smokeryes 23836.41 414.14 57.557 < 0.0000000000000002 ***
## regionnorthwest -380.04 476.56 -0.797 0.425318
## regionsoutheast -1033.14 479.14 -2.156 0.031245 *
## regionsouthwest -952.89 478.15 -1.993 0.046483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7497
## F-statistic: 334.7 on 12 and 1325 DF, p-value: < 0.00000000000000022
From the model on above, we get Adjusted R-squared value (0.7497) and p-value < 0.05, so : - the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
the predictor variable doesn’t have significant positive/negative influence to target variable (such as sexmale, children1, children3, children5, regionnorthwest)
summary(insurance_ols)##
## Call:
## lm(formula = charges ~ bmi, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 0.000000000000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
insurance_ols2 <- lm(formula = charges ~ bmi+smoker, data = insurance)
summary(insurance_ols2)##
## Call:
## lm(formula = charges ~ bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15992.7 -4600.2 -802.4 3636.2 30677.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3459.10 998.28 -3.465 0.000547 ***
## bmi 388.02 31.79 12.207 < 0.0000000000000002 ***
## smokeryes 23593.98 480.18 49.136 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7088 on 1335 degrees of freedom
## Multiple R-squared: 0.6579, Adjusted R-squared: 0.6574
## F-statistic: 1284 on 2 and 1335 DF, p-value: < 0.00000000000000022
head(insurance_ols2$fitted.values)## 1 2 3 4 5 6
## 30960.511 9644.179 9345.408 5350.791 7746.785 6528.417
RMSE(y_pred = insurance_ols2$fitted.values, y_true = insurance$charges)## [1] 7079.981
compare_performance(insurance_ols,
insurance_ols2,
insurance_all)From the comparison model on above, we get the best model for data insurance is model insurance_all with lowest error. For insurance_ols and insurance_ols2, we can see the more predictors, the more least AIC and RMSE values. Predictor variable smoker has a significant effect on the model insurance_ols2.
# model backward
model_back <- stats::step(insurance_all, direction = "backward")## Start: AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## - sex 1 5442885 48644525941 23317
## <none> 48639083056 23319
## - region 3 226406038 48865489094 23319
## - children 5 637996402 49277079458 23326
## - bmi 1 5089678622 53728761678 23450
## - age 1 17105754278 65744837334 23720
## - smoker 1 121607401409 170246484465 24993
##
## Step: AIC=23317.07
## charges ~ age + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## <none> 48644525941 23317
## - region 3 226128753 48870654694 23317
## - children 5 636681365 49281207306 23325
## - bmi 1 5085311935 53729837877 23448
## - age 1 17130650713 65775176654 23719
## - smoker 1 122195213508 170839739450 24996
summary(model_back)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11620.3 -2883.5 -945.6 1513.0 29986.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11977.26 984.79 -12.162 < 0.0000000000000002 ***
## age 257.30 11.91 21.609 < 0.0000000000000002 ***
## bmi 336.39 28.57 11.774 < 0.0000000000000002 ***
## children1 388.71 421.17 0.923 0.356211
## children2 1635.23 466.52 3.505 0.000471 ***
## children3 962.98 547.91 1.758 0.079055 .
## children4 2938.65 1238.56 2.373 0.017804 *
## children5 1106.45 1455.33 0.760 0.447227
## smokeryes 23824.24 412.80 57.714 < 0.0000000000000002 ***
## regionnorthwest -379.44 476.40 -0.796 0.425908
## regionsoutheast -1032.43 478.98 -2.155 0.031304 *
## regionsouthwest -952.16 478.00 -1.992 0.046577 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6057 on 1326 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7498
## F-statistic: 365.3 on 11 and 1326 DF, p-value: < 0.00000000000000022
From the model backward above, we get :
the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
the predictor variable doesn’t have significant positive/negative influence to target variable (such as children1, children3, children5, regionnorthwest)
# model forward
model_forward <- stats::step(insurance_all, direction = "forward")## Start: AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region
summary(model_forward)##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11689.4 -2902.6 -943.7 1492.2 30042.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11927.17 993.66 -12.003 < 0.0000000000000002 ***
## age 257.19 11.91 21.587 < 0.0000000000000002 ***
## sexmale -128.16 332.83 -0.385 0.700254
## bmi 336.91 28.61 11.775 < 0.0000000000000002 ***
## children1 390.98 421.35 0.928 0.353619
## children2 1635.78 466.67 3.505 0.000471 ***
## children3 964.34 548.10 1.759 0.078735 .
## children4 2947.37 1239.16 2.379 0.017524 *
## children5 1116.04 1456.02 0.767 0.443514
## smokeryes 23836.41 414.14 57.557 < 0.0000000000000002 ***
## regionnorthwest -380.04 476.56 -0.797 0.425318
## regionsoutheast -1033.14 479.14 -2.156 0.031245 *
## regionsouthwest -952.89 478.15 -1.993 0.046483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7497
## F-statistic: 334.7 on 12 and 1325 DF, p-value: < 0.00000000000000022
From the model forward on above, we get :
the predictor variable have significant positive influence to target variable (such as age, bmi, children2, children4, smokeryes)
the predictor variable have significant negative influence to target variable (such as regionsoutheast, regionsouthwest)
the predictor variable doesn’t have significant positive/negative influence to target variable (such as sexmale, children1, children3, children5, regionnorthwest)
# model both
model_both <- stats::step(insurance_all, direction = "both")## Start: AIC=23318.92
## charges ~ age + sex + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## - sex 1 5442885 48644525941 23317
## <none> 48639083056 23319
## - region 3 226406038 48865489094 23319
## - children 5 637996402 49277079458 23326
## - bmi 1 5089678622 53728761678 23450
## - age 1 17105754278 65744837334 23720
## - smoker 1 121607401409 170246484465 24993
##
## Step: AIC=23317.07
## charges ~ age + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## <none> 48644525941 23317
## - region 3 226128753 48870654694 23317
## + sex 1 5442885 48639083056 23319
## - children 5 636681365 49281207306 23325
## - bmi 1 5085311935 53729837877 23448
## - age 1 17130650713 65775176654 23719
## - smoker 1 122195213508 170839739450 24996
summary(model_both)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11620.3 -2883.5 -945.6 1513.0 29986.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11977.26 984.79 -12.162 < 0.0000000000000002 ***
## age 257.30 11.91 21.609 < 0.0000000000000002 ***
## bmi 336.39 28.57 11.774 < 0.0000000000000002 ***
## children1 388.71 421.17 0.923 0.356211
## children2 1635.23 466.52 3.505 0.000471 ***
## children3 962.98 547.91 1.758 0.079055 .
## children4 2938.65 1238.56 2.373 0.017804 *
## children5 1106.45 1455.33 0.760 0.447227
## smokeryes 23824.24 412.80 57.714 < 0.0000000000000002 ***
## regionnorthwest -379.44 476.40 -0.796 0.425908
## regionsoutheast -1032.43 478.98 -2.155 0.031304 *
## regionsouthwest -952.16 478.00 -1.992 0.046577 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6057 on 1326 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7498
## F-statistic: 365.3 on 11 and 1326 DF, p-value: < 0.00000000000000022
From the model both on above, we get :
age, bmi, children2, children4, smokeryes)regionsoutheast, regionsouthwest)children1, children3, children5, regionnorthwest)compare_performance(model_back,
model_forward,
model_both)From the comparison model above, we get the best model with least AIC (27116.153) is model_back or model_both.
Linearity hypothesis test: H0: the correlation doesn’t significant H1: the correlation does significant
cor.test(x = insurance$bmi, y = insurance$charges)##
## Pearson's product-moment correlation
##
## data: insurance$bmi and insurance$charges
## t = 7.3966, df = 1336, p-value = 0.0000000000002459
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1463052 0.2492822
## sample estimates:
## cor
## 0.198341
From the result on above, we get p-value (0.0000000000002459) < 0.05 (disagree H0), so the correlation between charges and bmi does significant.
Shapiro-Wilk hypothesis test:
H0: error/residuals does normal distributed H1: error/residuals doesn’t normal distributed
# ideal model : pvalue > alpha (0.05)
shapiro.test(insurance_all$residuals)##
## Shapiro-Wilk normality test
##
## data: insurance_all$residuals
## W = 0.90383, p-value < 0.00000000000000022
shapiro.test(model_back$residuals)##
## Shapiro-Wilk normality test
##
## data: model_back$residuals
## W = 0.90392, p-value < 0.00000000000000022
From the result on above, we get p-value (0.00000000000000022) < 0.05 (disagree H0), so residuals model doesn’t normal distributed.
Statistic Test with Breusch-Pagan from package lmtest
Hypothesis : H0: Variance of spread error is constant (Homoscedasticity) H1: Variance of spread error is not constant / forms a pattern (Heteroscedasticity)
# ideal model : p-value > alpha (0.05) -> Homoscedasticity
bptest(insurance_all)##
## studentized Breusch-Pagan test
##
## data: insurance_all
## BP = 132.47, df = 12, p-value < 0.00000000000000022
bptest(model_back)##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 131.59, df = 11, p-value < 0.00000000000000022
From the result on above, we get p-value (0.00000000000000022) < 0.05 (disagree H0). Then, the residual is heterogenous or the model does not Homoscedasticity.
plot(model_back$fitted.values, model_back$residuals, ylim = c(-50,50))
abline(h = 0, col = "red")plot(insurance_all$fitted.values, insurance_all$residuals, ylim = c(-50,50))
abline(h = 0, col = "red")Based on the scatterplot for both model insurance_all and model_back, the spread of error / residuals is random.
# ideal model : VIF(Variance Inflation Factor) < 10
vif(insurance_all)## GVIF Df GVIF^(1/(2*Df))
## age 1.020607 1 1.010251
## sex 1.009334 1 1.004656
## bmi 1.108836 1 1.053013
## children 1.024871 5 1.002460
## smoker 1.018024 1 1.008972
## region 1.109919 3 1.017533
vif(model_back)## GVIF Df GVIF^(1/(2*Df))
## age 1.020006 1 1.009954
## bmi 1.106363 1 1.051838
## children 1.024128 5 1.002387
## smoker 1.012092 1 1.006028
## region 1.109896 3 1.017530
Based on the result on above, we get VIF < 10. So, there is no multicolinearity in the model above.
Variables that are useful to describe the variances in insurance charges are age, bmi, and smoker. Our final model has satisfied the classical assumptions. The R-squared of the model is good enough, with 74,98% of the variables can explain the variances in the insurance charges. The accuracy of the model in predicting the insurance charges is measured with RMSE of 6029.269. We have already learn how to build a linear regression model and what need to be concerned when building the model.
I get data insurance from https://www.kaggle.com/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook