Health Insurance ilustration. (Source: Appsergu)
In this project, I will be using linear regression to analyze which factors are good to predict health insurances charge. The dataset used in this analysis is insurance dataset from Github.
Load the required libraries
library(tidyverse) # data manipulation
library(GGally) # correlation checking
library(lmtest)
library(car)
library(see)
library(MLmetrics)
Load the datase and convert the categorical variables into factor.
<- read.csv("insurance.csv", stringsAsFactors = T)
insurance
::paged_table(insurance) rmarkdown
Inspect the data structure.
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
There are 6 predictor variables in this dataset that will be used in the analysis, namely:
* Age
: age of the beneficiary
* Sex
: sex of the beneficiary
* BMI
: body mass index; a value derived from person’s weight (in kg) divided by the square of height (in m)
* Children
: number of children
* Smoker
: smoking status
* Region
: the beneficiary’s residential area
All variables already have the correct data type. Now, check if there is any duplicated data.
duplicated(insurance), ] insurance[
## age sex bmi children smoker region charges
## 582 19 male 30.59 0 no northwest 1639.563
Turns out there is one. Eliminate the duplicated data.
<- insurance %>% distinct() insurance
Check for missing data.
anyNA(insurance)
## [1] FALSE
Looks like there is no missing data in this dataset. Now, we can proceed with the further analysis.
Now that the duplicated data is gone, check for the summary.
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1063
## 1st Qu.:27.00 male :675 1st Qu.:26.29 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.22 Mean :30.66 Mean :1.096
## 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:324 1st Qu.: 4746
## southeast:364 Median : 9386
## southwest:325 Mean :13279
## 3rd Qu.:16658
## Max. :63770
The youngest insurance beneficiary is 18 years old, while the oldest is 64 years old. The proportion of female and male beneficiaries are almost equal. In terms of BMI, three quarters of the insurance beneficiaries are overweight (with BMI value more than 25). Half of the beneficiaries have 1 or more children, while the other half do not have any children. Majority of the insurance beneficiaries do not smoke. Based on region, there are similar number of people for each area.
This analysis aims to predict the insurance charges based on several predictors using linear regression, so the target variable is charges
. Let’s inspect the distribution of the target variable.
hist(insurance$charges)
The distribution of the insurance charge is right-skewed, with a little number of beneficiaries are charged more than $50,000.
Now, let’s see the distribution of insurance charges based on the predictors.
# Sex
ggplot(insurance, aes(sex, charges)) +
geom_boxplot(aes(fill = sex))
From the plot, it looks like there is no significant difference of insurance charges between male and female.
# Region
ggplot(insurance, aes(region, charges)) +
geom_boxplot(aes(fill = region))
Based on region, there is also no significant different in charges for each category.
# Smoking status
ggplot(insurance, aes(smoker, charges)) +
geom_boxplot(aes(fill = smoker))
In terms of smoking status, people who smoke tend to be charged more than people who do not smoke.
# BMI
ggplot(insurance, aes(bmi, charges)) +
geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
There is an upward trend of insurance charges based on bmi, where the higher the bmi, the higher beneficiaries will be charged.
# Based on age
ggplot(insurance, aes(age, charges)) +
geom_point() +geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
There is an upward trend of charges based on age as well.
# Based on number of children
ggplot(insurance, aes(children, charges)) +
geom_point()
Based on number of children, it seems that people with 5 children tend to be charged less.
Now, look at the correlation between numerical variables and the target variable.
ggcorr(insurance, label = T)
## Warning in ggcorr(insurance, label = T): data in column(s) 'sex', 'smoker',
## 'region' are not numeric and were ignored
From numerical predictors in the data, age has the highest correlation with the target variable (0.3).
Before creating the model, split the dataset into train and test data.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(110)
<- sample(nrow(insurance), nrow(insurance)*0.8)
index <- insurance[index, ]
train <- insurance[-index, ] test
Create the first model using the predictor variable with the highest correlation with the target variable, age
.
<- lm(charges ~ age, data = train)
model1 summary(model1)
##
## Call:
## lm(formula = charges ~ age, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8102 -6706 -5875 5353 47881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3669.22 1033.71 3.55 0.000403 ***
## age 245.38 24.91 9.85 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1067 degrees of freedom
## Multiple R-squared: 0.08336, Adjusted R-squared: 0.0825
## F-statistic: 97.03 on 1 and 1067 DF, p-value: < 2.2e-16
From the summary, we can see the intercept value of the model or the insurance charge for 0-year-old beneficiary is $3,591.48
. Age
variable has a significant effect on the target variable and for every 1 year increased of beneficiary’s age, the insurance charge will increase by $246.51
. From the model we get the adjusted r-squared value of 0.084 or 8.4%, which means this model can only explain 8,4% from the total variation of charges
.
Next, create a model using all numeric predictors.
<- lm(charges ~ age + bmi + children, data = train)
model2 summary(model2)
##
## Call:
## lm(formula = charges ~ age + bmi + children, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13875 -7019 -5121 7092 48730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5993.34 1965.32 -3.050 0.00235 **
## age 225.78 24.78 9.113 < 2e-16 ***
## bmi 319.37 57.88 5.518 4.31e-08 ***
## children 566.18 287.41 1.970 0.04910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11390 on 1065 degrees of freedom
## Multiple R-squared: 0.1117, Adjusted R-squared: 0.1092
## F-statistic: 44.62 on 3 and 1065 DF, p-value: < 2.2e-16
The adjusted R-squared increases by 2% and every numerical predictors have significant effect on the model, but the model still can be improved.
Now, let’s use all predictors to predict the target variable.
<- lm(charges ~ ., data = train)
model_full summary(model_full)
##
## Call:
## lm(formula = charges ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11502.8 -2906.6 -988.5 1607.1 30767.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11485.42 1113.03 -10.319 < 2e-16 ***
## age 249.09 13.32 18.696 < 2e-16 ***
## sexmale -514.92 375.38 -1.372 0.170434
## bmi 351.59 32.45 10.836 < 2e-16 ***
## children 535.65 154.48 3.467 0.000546 ***
## smokeryes 23759.54 464.18 51.186 < 2e-16 ***
## regionnorthwest -897.36 544.24 -1.649 0.099480 .
## regionsoutheast -1816.79 539.20 -3.369 0.000780 ***
## regionsouthwest -1240.52 533.22 -2.326 0.020180 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6111 on 1060 degrees of freedom
## Multiple R-squared: 0.7455, Adjusted R-squared: 0.7436
## F-statistic: 388.2 on 8 and 1060 DF, p-value: < 2.2e-16
The adjusted R-squared improves a lot. In the new model, the predictors can explain 74.4% of charge
total variations, much higher than the previous models. But, as we can see several predictors, namely sex and region do not give significant effect on the model.
To automatically select the best predictors for the model, we can utilize step-wise regression. This method automatically chooses model with the lowest AIC or. The first method is called backward elimination.
<- step(object = model_full, direction= "backward", trace = 0)
insurance_backward summary(insurance_backward)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11214.5 -2928.4 -948.9 1513.0 30544.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11697.92 1102.66 -10.609 < 2e-16 ***
## age 249.44 13.33 18.717 < 2e-16 ***
## bmi 350.03 32.44 10.790 < 2e-16 ***
## children 528.95 154.47 3.424 0.000640 ***
## smokeryes 23715.60 463.26 51.193 < 2e-16 ***
## regionnorthwest -901.21 544.46 -1.655 0.098175 .
## regionsoutheast -1820.42 539.42 -3.375 0.000765 ***
## regionsouthwest -1240.82 533.44 -2.326 0.020202 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6113 on 1061 degrees of freedom
## Multiple R-squared: 0.7451, Adjusted R-squared: 0.7434
## F-statistic: 443 on 7 and 1061 DF, p-value: < 2.2e-16
From the backward model, there is a slight decrease in adjusted R-squared and one predictor, sex
, is eliminated.
Let’s use another method, called forward selection and see if there is any difference.
# Forward
<- lm(charges ~ 1, data = train)
model_none <- step(object = model_none, direction = "forward", scope = list(lower = model_none, upper = model_full), trace = 0)
model_forward summary(model_forward)
##
## Call:
## lm(formula = charges ~ smoker + age + bmi + children + region,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11214.5 -2928.4 -948.9 1513.0 30544.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11697.92 1102.66 -10.609 < 2e-16 ***
## smokeryes 23715.60 463.26 51.193 < 2e-16 ***
## age 249.44 13.33 18.717 < 2e-16 ***
## bmi 350.03 32.44 10.790 < 2e-16 ***
## children 528.95 154.47 3.424 0.000640 ***
## regionnorthwest -901.21 544.46 -1.655 0.098175 .
## regionsoutheast -1820.42 539.42 -3.375 0.000765 ***
## regionsouthwest -1240.82 533.44 -2.326 0.020202 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6113 on 1061 degrees of freedom
## Multiple R-squared: 0.7451, Adjusted R-squared: 0.7434
## F-statistic: 443 on 7 and 1061 DF, p-value: < 2.2e-16
Looks like the result derived from this method is the same as the previous one, so let’s choose one of them as the final model.
<- model_forward
model_final model_final
##
## Call:
## lm(formula = charges ~ smoker + age + bmi + children + region,
## data = train)
##
## Coefficients:
## (Intercept) smokeryes age bmi
## -11697.9 23715.6 249.4 350.0
## children regionnorthwest regionsoutheast regionsouthwest
## 528.9 -901.2 -1820.4 -1240.8
For the final model, we have five predictors and all of which are significant on the target variable. From the coefficients, we can see that smoker has the biggest coefficient. This model’s adjusted R-squared is also 0.7434, which means the model explains 74.34% of total variation in charges.
Here is the final model. \[ charge = -11697.9 + 23715.6 \times smokeryes + 249.4 \times age + 350\times bmi + 528.9 \times children - 901.2 \times regionnorthwest - 1820.4 \times regionsoutheast - 1.240.8 \times regionsouthwest \]
Now, we can use the model to predict the new data, test
.
$predicted <- predict(model_forward, newdata = test) test
To see how good our model is to predict new data, we can use RMSE
or Root Mean Square Error
.
# predict new data using the final model
<- predict(model_final, newdata = test)
predicted_charges
# RMSE Data Test
RMSE(y_pred = predicted_charges, y_true = test$charges)
## [1] 5898.908
# RMSE Data Train
RMSE(y_pred = model_final$fitted.values, y_true = train$charges)
## [1] 6090.49
The RMSE in train data and test data do not differ much, which means the model does not overfit or underfit.
There are several assumptions that need to be met by a linear regression model. After creating the model, let’s check if our model meets those assumptions.
The first assumption is in a linear regression model, there has to be a linear relationship between predictors and the target variable.
cor.test(insurance$bmi, insurance$charges)
##
## Pearson's product-moment correlation
##
## data: insurance$bmi and insurance$charges
## t = 7.3961, df = 1335, p-value = 2.468e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1463465 0.2493595
## sample estimates:
## cor
## 0.1984008
cor.test(insurance$age, insurance$charges)
##
## Pearson's product-moment correlation
##
## data: insurance$age and insurance$charges
## t = 11.419, df = 1335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2486742 0.3463797
## sample estimates:
## cor
## 0.2983082
cor.test(insurance$children, insurance$charges)
##
## Pearson's product-moment correlation
##
## data: insurance$children and insurance$charges
## t = 2.4679, df = 1335, p-value = 0.01372
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01382835 0.12056473
## sample estimates:
## cor
## 0.06738935
All numerical variables in the model have linear relationship with the target variable.
shapiro.test(model_final$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_final$residuals
## W = 0.89846, p-value < 2.2e-16
P-value derived from the test is lower than the alpha, which means H0 is rejected or the residuals are not normally distributed.
bptest(formula = model_final)
##
## studentized Breusch-Pagan test
##
## data: model_final
## BP = 102.56, df = 7, p-value < 2.2e-16
This model also violated homoscedasticity assumption, which means there is some pattern in the variance of residuals.
vif(model_forward)
## GVIF Df GVIF^(1/(2*Df))
## smoker 1.006698 1 1.003343
## age 1.023332 1 1.011599
## bmi 1.109003 1 1.053092
## children 1.005254 1 1.002624
## region 1.100307 3 1.016059
The result shows that there is no multicollinearity between variables.
Variables that are good to explain the variances in insurance charges are age, bmi, smoker, region, and children. The model shows several violations of linear regression assumptions, where the residuals are not normally distributed and show some patterns when they should not (heteroscedasticity). In order to create a better model that meets all the assumptions, we can add the number of observations and transform the predictors or target variable.
The adjusted R-squared of the model is 74,34% that means the model with its predictors can explain 74,34% of total variances in the insurance charges. The perfomance of the model in predicting the target variable is measured with RMSE, with training data has RMSE of 6090.49 and testing data has RMSE of 5898.908