1 Introduction

Health Insurance ilustration. (Source: Appsergu)

In this project, I will be using linear regression to analyze which factors are good to predict health insurances charge. The dataset used in this analysis is insurance dataset from Github.

2 Data Preparation

Load the required libraries

library(tidyverse) # data manipulation
library(GGally) # correlation checking
library(lmtest)  
library(car)
library(see)
library(MLmetrics)

Load the datase and convert the categorical variables into factor.

insurance <- read.csv("insurance.csv", stringsAsFactors = T)

rmarkdown::paged_table(insurance)

Inspect the data structure.

str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

There are 6 predictor variables in this dataset that will be used in the analysis, namely:
* Age: age of the beneficiary
* Sex: sex of the beneficiary
* BMI: body mass index; a value derived from person’s weight (in kg) divided by the square of height (in m)
* Children: number of children
* Smoker: smoking status
* Region: the beneficiary’s residential area

All variables already have the correct data type. Now, check if there is any duplicated data.

insurance[duplicated(insurance), ]

##     age  sex   bmi children smoker    region  charges
## 582  19 male 30.59        0     no northwest 1639.563

Turns out there is one. Eliminate the duplicated data.

insurance <- insurance %>% distinct()

Check for missing data.

anyNA(insurance)

## [1] FALSE

Looks like there is no missing data in this dataset. Now, we can proceed with the further analysis.

Now that the duplicated data is gone, check for the summary.

summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1063  
##  1st Qu.:27.00   male  :675   1st Qu.:26.29   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.22                Mean   :30.66   Mean   :1.096             
##  3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:324   1st Qu.: 4746  
##  southeast:364   Median : 9386  
##  southwest:325   Mean   :13279  
##                  3rd Qu.:16658  
##                  Max.   :63770

The youngest insurance beneficiary is 18 years old, while the oldest is 64 years old. The proportion of female and male beneficiaries are almost equal. In terms of BMI, three quarters of the insurance beneficiaries are overweight (with BMI value more than 25). Half of the beneficiaries have 1 or more children, while the other half do not have any children. Majority of the insurance beneficiaries do not smoke. Based on region, there are similar number of people for each area.

3 Exploratory Data Analysis

This analysis aims to predict the insurance charges based on several predictors using linear regression, so the target variable is charges. Let’s inspect the distribution of the target variable.

hist(insurance$charges)

The distribution of the insurance charge is right-skewed, with a little number of beneficiaries are charged more than $50,000.

Now, let’s see the distribution of insurance charges based on the predictors.

# Sex
ggplot(insurance, aes(sex, charges)) +
  geom_boxplot(aes(fill = sex))

From the plot, it looks like there is no significant difference of insurance charges between male and female.

# Region

ggplot(insurance, aes(region, charges)) +
  geom_boxplot(aes(fill = region))

Based on region, there is also no significant different in charges for each category.

# Smoking status
ggplot(insurance, aes(smoker, charges)) +
  geom_boxplot(aes(fill = smoker))

In terms of smoking status, people who smoke tend to be charged more than people who do not smoke.

# BMI
ggplot(insurance, aes(bmi, charges)) +
  geom_point() +  geom_smooth(method = 'lm')

## `geom_smooth()` using formula 'y ~ x'

There is an upward trend of insurance charges based on bmi, where the higher the bmi, the higher beneficiaries will be charged.

# Based on age
ggplot(insurance, aes(age, charges)) +
  geom_point() +geom_smooth(method = 'lm')

## `geom_smooth()` using formula 'y ~ x'

There is an upward trend of charges based on age as well.

# Based on number of children
ggplot(insurance, aes(children, charges)) +
  geom_point()

Based on number of children, it seems that people with 5 children tend to be charged less.

Now, look at the correlation between numerical variables and the target variable.

ggcorr(insurance, label = T)

## Warning in ggcorr(insurance, label = T): data in column(s) 'sex', 'smoker',
## 'region' are not numeric and were ignored

From numerical predictors in the data, age has the highest correlation with the target variable (0.3).

4 Modeling

Before creating the model, split the dataset into train and test data.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(110)
index <- sample(nrow(insurance), nrow(insurance)*0.8)
train <- insurance[index, ]
test <- insurance[-index, ]

Create the first model using the predictor variable with the highest correlation with the target variable, age.

model1 <- lm(charges ~ age, data = train)
summary(model1)

## 
## Call:
## lm(formula = charges ~ age, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8102  -6706  -5875   5353  47881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3669.22    1033.71    3.55 0.000403 ***
## age           245.38      24.91    9.85  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11560 on 1067 degrees of freedom
## Multiple R-squared:  0.08336,    Adjusted R-squared:  0.0825 
## F-statistic: 97.03 on 1 and 1067 DF,  p-value: < 2.2e-16

From the summary, we can see the intercept value of the model or the insurance charge for 0-year-old beneficiary is $3,591.48. Age variable has a significant effect on the target variable and for every 1 year increased of beneficiary’s age, the insurance charge will increase by $246.51. From the model we get the adjusted r-squared value of 0.084 or 8.4%, which means this model can only explain 8,4% from the total variation of charges.

Next, create a model using all numeric predictors.

model2 <- lm(charges ~ age + bmi + children, data = train)
summary(model2)

## 
## Call:
## lm(formula = charges ~ age + bmi + children, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13875  -7019  -5121   7092  48730 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5993.34    1965.32  -3.050  0.00235 ** 
## age           225.78      24.78   9.113  < 2e-16 ***
## bmi           319.37      57.88   5.518 4.31e-08 ***
## children      566.18     287.41   1.970  0.04910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11390 on 1065 degrees of freedom
## Multiple R-squared:  0.1117, Adjusted R-squared:  0.1092 
## F-statistic: 44.62 on 3 and 1065 DF,  p-value: < 2.2e-16

The adjusted R-squared increases by 2% and every numerical predictors have significant effect on the model, but the model still can be improved.

Now, let’s use all predictors to predict the target variable.

model_full <- lm(charges ~ ., data = train)
summary(model_full)

## 
## Call:
## lm(formula = charges ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11502.8  -2906.6   -988.5   1607.1  30767.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11485.42    1113.03 -10.319  < 2e-16 ***
## age                249.09      13.32  18.696  < 2e-16 ***
## sexmale           -514.92     375.38  -1.372 0.170434    
## bmi                351.59      32.45  10.836  < 2e-16 ***
## children           535.65     154.48   3.467 0.000546 ***
## smokeryes        23759.54     464.18  51.186  < 2e-16 ***
## regionnorthwest   -897.36     544.24  -1.649 0.099480 .  
## regionsoutheast  -1816.79     539.20  -3.369 0.000780 ***
## regionsouthwest  -1240.52     533.22  -2.326 0.020180 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6111 on 1060 degrees of freedom
## Multiple R-squared:  0.7455, Adjusted R-squared:  0.7436 
## F-statistic: 388.2 on 8 and 1060 DF,  p-value: < 2.2e-16

The adjusted R-squared improves a lot. In the new model, the predictors can explain 74.4% of charge total variations, much higher than the previous models. But, as we can see several predictors, namely sex and region do not give significant effect on the model.

To automatically select the best predictors for the model, we can utilize step-wise regression. This method automatically chooses model with the lowest AIC or. The first method is called backward elimination.

insurance_backward <- step(object = model_full, direction= "backward", trace = 0)
summary(insurance_backward)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11214.5  -2928.4   -948.9   1513.0  30544.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11697.92    1102.66 -10.609  < 2e-16 ***
## age                249.44      13.33  18.717  < 2e-16 ***
## bmi                350.03      32.44  10.790  < 2e-16 ***
## children           528.95     154.47   3.424 0.000640 ***
## smokeryes        23715.60     463.26  51.193  < 2e-16 ***
## regionnorthwest   -901.21     544.46  -1.655 0.098175 .  
## regionsoutheast  -1820.42     539.42  -3.375 0.000765 ***
## regionsouthwest  -1240.82     533.44  -2.326 0.020202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6113 on 1061 degrees of freedom
## Multiple R-squared:  0.7451, Adjusted R-squared:  0.7434 
## F-statistic:   443 on 7 and 1061 DF,  p-value: < 2.2e-16

From the backward model, there is a slight decrease in adjusted R-squared and one predictor, sex, is eliminated.

Let’s use another method, called forward selection and see if there is any difference.

# Forward
model_none <- lm(charges ~ 1, data = train)
model_forward <- step(object = model_none, direction = "forward", scope = list(lower = model_none, upper = model_full), trace = 0)
summary(model_forward)

## 
## Call:
## lm(formula = charges ~ smoker + age + bmi + children + region, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11214.5  -2928.4   -948.9   1513.0  30544.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11697.92    1102.66 -10.609  < 2e-16 ***
## smokeryes        23715.60     463.26  51.193  < 2e-16 ***
## age                249.44      13.33  18.717  < 2e-16 ***
## bmi                350.03      32.44  10.790  < 2e-16 ***
## children           528.95     154.47   3.424 0.000640 ***
## regionnorthwest   -901.21     544.46  -1.655 0.098175 .  
## regionsoutheast  -1820.42     539.42  -3.375 0.000765 ***
## regionsouthwest  -1240.82     533.44  -2.326 0.020202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6113 on 1061 degrees of freedom
## Multiple R-squared:  0.7451, Adjusted R-squared:  0.7434 
## F-statistic:   443 on 7 and 1061 DF,  p-value: < 2.2e-16

Looks like the result derived from this method is the same as the previous one, so let’s choose one of them as the final model.

model_final <- model_forward
model_final

## 
## Call:
## lm(formula = charges ~ smoker + age + bmi + children + region, 
##     data = train)
## 
## Coefficients:
##     (Intercept)        smokeryes              age              bmi  
##        -11697.9          23715.6            249.4            350.0  
##        children  regionnorthwest  regionsoutheast  regionsouthwest  
##           528.9           -901.2          -1820.4          -1240.8

For the final model, we have five predictors and all of which are significant on the target variable. From the coefficients, we can see that smoker has the biggest coefficient. This model’s adjusted R-squared is also 0.7434, which means the model explains 74.34% of total variation in charges.

Here is the final model. \[ charge = -11697.9 + 23715.6 \times smokeryes + 249.4 \times age + 350\times bmi + 528.9 \times children - 901.2 \times regionnorthwest - 1820.4 \times regionsoutheast - 1.240.8 \times regionsouthwest \]

Now, we can use the model to predict the new data, test.

test$predicted <- predict(model_forward, newdata = test)

5 Model Evaluation

5.1 Model Performance

To see how good our model is to predict new data, we can use RMSE or Root Mean Square Error.

# predict new data using the final model

predicted_charges <- predict(model_final, newdata = test)

# RMSE Data Test

RMSE(y_pred = predicted_charges, y_true = test$charges)

## [1] 5898.908

# RMSE Data Train

RMSE(y_pred = model_final$fitted.values, y_true = train$charges)

## [1] 6090.49

The RMSE in train data and test data do not differ much, which means the model does not overfit or underfit.

5.2 Assumptions

There are several assumptions that need to be met by a linear regression model. After creating the model, let’s check if our model meets those assumptions.

5.2.1 Linearity

The first assumption is in a linear regression model, there has to be a linear relationship between predictors and the target variable.

cor.test(insurance$bmi, insurance$charges)

## 
##  Pearson's product-moment correlation
## 
## data:  insurance$bmi and insurance$charges
## t = 7.3961, df = 1335, p-value = 2.468e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1463465 0.2493595
## sample estimates:
##       cor 
## 0.1984008

cor.test(insurance$age, insurance$charges)

## 
##  Pearson's product-moment correlation
## 
## data:  insurance$age and insurance$charges
## t = 11.419, df = 1335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2486742 0.3463797
## sample estimates:
##       cor 
## 0.2983082

cor.test(insurance$children, insurance$charges)

## 
##  Pearson's product-moment correlation
## 
## data:  insurance$children and insurance$charges
## t = 2.4679, df = 1335, p-value = 0.01372
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01382835 0.12056473
## sample estimates:
##        cor 
## 0.06738935

All numerical variables in the model have linear relationship with the target variable.

5.2.2 Normality of Residual

shapiro.test(model_final$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_final$residuals
## W = 0.89846, p-value < 2.2e-16

P-value derived from the test is lower than the alpha, which means H0 is rejected or the residuals are not normally distributed.

5.2.3 Homoscedasticity

bptest(formula = model_final)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_final
## BP = 102.56, df = 7, p-value < 2.2e-16

This model also violated homoscedasticity assumption, which means there is some pattern in the variance of residuals.

5.2.4 No Multicollinearity

vif(model_forward)

##              GVIF Df GVIF^(1/(2*Df))
## smoker   1.006698  1        1.003343
## age      1.023332  1        1.011599
## bmi      1.109003  1        1.053092
## children 1.005254  1        1.002624
## region   1.100307  3        1.016059

The result shows that there is no multicollinearity between variables.

6 Conclusion

Variables that are good to explain the variances in insurance charges are age, bmi, smoker, region, and children. The model shows several violations of linear regression assumptions, where the residuals are not normally distributed and show some patterns when they should not (heteroscedasticity). In order to create a better model that meets all the assumptions, we can add the number of observations and transform the predictors or target variable.

The adjusted R-squared of the model is 74,34% that means the model with its predictors can explain 74,34% of total variances in the insurance charges. The perfomance of the model in predicting the target variable is measured with RMSE, with training data has RMSE of 6090.49 and testing data has RMSE of 5898.908

7 References

https://www.cdc.gov/healthyweight/assessing/bmi/index.html

LBB Regression

Eldora Nadellia