Predict Your Medical Insurance Premium

An Implementation of Linear Regression in Insurance Industry

1. Preface

1.1 Background

It is natural that we all will do whatever it takes to protect ourselves, our family, and our loved ones, especially in matters related to health. However, it is not uncommon that we also want to do it at a cost we can afford - fits within our financial budget.

A recent study from industry groups LIMRA and Life Happens found out that 102 million uninsured and underinsured Americans (which are 40% of the country’s adult population) believe that they need to purchase medical insurance or invest in more coverage. However, more than one in three put off buying coverage because they often think it costs more than it does. That matter can put their family at risk if they get in accident unexpectedly.

Many people believe medical insurance is out of reach financially, often because they don’t know how medical insurance factors impact pricing. Yet medical insurance is one of the most affordable parts of a strong financial plan. Since cost misunderstandings are a main reason why people don’t buy, let’s consider what really goes into the premium price of medical insurance.

In this project, we will focus on linear regression to investigate how the predictors influence the medical insurance premium price. In the future, you can also use the final equation/model to estimate the price of your insurance premium by putting in several conditions. So, without further ado, check this out~ 😉

⚠️ Disclaimer:

Predictions generated by the final equation/model are estimates only. In fact, there are another conditions that can affect the accuracy of predictions (such as year, exchange rate, inflation, insurance company rules, etc.).

My opinion: just focus on how each variable affects the premium price (whether the predictor variable give a positive effect/increasing premium price or give negative effect/lowering premium price)

1.2 About the Dataset

This project uses the dataset taken from kaggle. This dataset contains information from 986 clients of an insurance company, including information regarding yearly premium prices.

The columns in the dataset consist of:

Age : the age of the client (in years)
Diabetes : whether the client has abnormal bloodsugar levels (0 = No, 1 = Yes)
BloodPressureProblems : whether the client has abnormal blood pressure levels (0 = No, 1 = Yes)
AnyTransplants : whether the client has any major organ transplants (0 = No, 1 = Yes)
AnyChronicDiseases : whether the client suffers from chronic ailments like asthma, etc. (0 = No, 1 = Yes)
Height : the height of the client (in centimeter)
Weight : the weight of the client (in kilogram)
KnownAllergies : whether the client has any known allergies (0 = No, 1 = Yes)
HistoryOfCancerInFamily : whether any blood relative of the client has had any form of cancer (0 = No, 1 = Yes)
NumberOfMajorSurgeries : the number of major surgeries that the client has had
PremiumPrice : yearly medical insurance premium price (in Indian Rupees/INR(₹) Currency)

2. Dependencies

Load all the libraries needed to work on this project:

# load libraries
library(dplyr) # for data transformation
library(DT) # to make a table in rmarkdown
library(ggplot2) # to make visualization
library(GGally) # for EDA or correlation check
library(performance) # for model comparison and assumption check
library(MLmetrics) # to calculate the error value

3. Read Data

# read data
insurance <- read.csv("data_input/insurance.csv", header=T, na.strings=c(""))

Based on the table above, we can see that our dataset consists of 11 variables.

4. Data Cleansing

First of all, we have to check the structure of our dataset - whether the data types are correct or not.

# check data structure
str(insurance)

## 'data.frame':    986 obs. of  11 variables:
##  $ Age                    : int  45 60 36 52 38 30 33 23 48 38 ...
##  $ Diabetes               : int  0 1 1 1 0 0 0 0 1 0 ...
##  $ BloodPressureProblems  : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ AnyTransplants         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyChronicDiseases     : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Height                 : int  155 180 158 183 166 160 150 181 169 182 ...
##  $ Weight                 : int  57 73 59 93 88 69 54 79 74 93 ...
##  $ KnownAllergies         : int  0 0 0 0 0 1 0 1 1 0 ...
##  $ HistoryOfCancerInFamily: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfMajorSurgeries : int  0 0 1 2 1 1 0 0 0 0 ...
##  $ PremiumPrice           : int  25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...

There seems to be some mismatched data types. So, before we proceed to the next step, we must change these data types.

# change data type
insurance[,c("Diabetes","BloodPressureProblems","AnyTransplants","AnyChronicDiseases",
  "KnownAllergies","HistoryOfCancerInFamily")] <-
  lapply(insurance[,c("Diabetes","BloodPressureProblems",
  "AnyTransplants","AnyChronicDiseases","KnownAllergies",
  "HistoryOfCancerInFamily")], as.factor)

We can also add a new column containing BMI (Body Mass Index) - a medical screening tool that measures the ratio of height to weight to estimate the amount of body fat.

# make BMI column
insurance <- insurance %>% 
  mutate(BMI = Weight/((Height/100)^2))

Check the data structure again for one last time before we move on to the next step.

# take a glimpse of the data
glimpse(insurance)

## Rows: 986
## Columns: 12
## $ Age                     <int> 45, 60, 36, 52, 38, 30, 33, 23, 48, 38, 60, 66~
## $ Diabetes                <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0~
## $ BloodPressureProblems   <fct> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0~
## $ AnyTransplants          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
## $ AnyChronicDiseases      <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Height                  <int> 155, 180, 158, 183, 166, 160, 150, 181, 169, 1~
## $ Weight                  <int> 57, 73, 59, 93, 88, 69, 54, 79, 74, 93, 74, 67~
## $ KnownAllergies          <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1~
## $ HistoryOfCancerInFamily <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ NumberOfMajorSurgeries  <int> 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 1~
## $ PremiumPrice            <int> 25000, 29000, 23000, 28000, 23000, 23000, 2100~
## $ BMI                     <dbl> 23.72529, 22.53086, 23.63403, 27.77031, 31.934~

Also check if there are any missing values.

# check missing values
colSums(is.na(insurance))

##                     Age                Diabetes   BloodPressureProblems 
##                       0                       0                       0 
##          AnyTransplants      AnyChronicDiseases                  Height 
##                       0                       0                       0 
##                  Weight          KnownAllergies HistoryOfCancerInFamily 
##                       0                       0                       0 
##  NumberOfMajorSurgeries            PremiumPrice                     BMI 
##                       0                       0                       0

It looks like that the data types are already correct. Furthermore, there are no missing values in our dataset, hence, we can proceed to the next phase.

5. Target Variable and Predictor Variable

Target variable: PremiumPrice
Predictor variables: Age, Diabetes, BloodPressureProblems, AnyTransplants, AnyChronicDiseases, Height, Weight, KnownAllergies, HistoryOfCancerInFamily, NumberOfMajorSurgeries, BMI

6. Feature Selection

We know that the BMI is the result of a calculation between Height and Weight. So, by intuition, we can remove Height and Weight from the predictor variables as their values are already represented by the BMI column.

# remove Height and Weight columns
insurance_clean <- insurance %>% 
  select(-c("Height","Weight"))

7. Exploratory Data Analysis (EDA)

Check whether our dataset has outliers.

# check outliers
ggplot(data = insurance_clean, mapping = aes(x = "", y = PremiumPrice)) +
  geom_boxplot(fill = "#C2981D", color = "#415C23") +
  xlab("") +
  ylab("Medical Insurance Premium Price")

It can be seen that our dataset has some outliers (although it’s not too much). Let’s see the value of these outliers.

# find out the value of the outliers
boxplot.stats(insurance_clean$PremiumPrice)$out

## [1] 39000 40000 39000 39000 39000 39000

So, we can conclude that the premium prices become outliers if the values are more than 39000.

But for now, let’s just ignore these outliers, and we’ll talk about them later when tuning the model.

After finding out about outliers, we continue by checking the correlation between variables.

# check correlation using heatmap
ggcorr(insurance_clean, hjust = 1, layout.exp = 1, label = TRUE)

# check correlation using correlation test
# a starred value means that there is a significant correlation
ggduo(insurance_clean, 
      c("Age","NumberOfMajorSurgeries","BMI"), "PremiumPrice", 
      types = list(continuous = "cor"))

# check correlation between categoric variable and target variable
ggduo(insurance_clean, 
      c("Diabetes","BloodPressureProblems","AnyTransplants"), 
      "PremiumPrice")

# check correlation between categoric variable and target variable
ggduo(insurance_clean, 
      c("AnyChronicDiseases","KnownAllergies","HistoryOfCancerInFamily"), 
      "PremiumPrice")

From the correlation check that has been done, we must pay attention to a few things, such as:

Correlation between numerical predictors and target variable: the correlation between Age, NumberOfMajorSurgeries, BMI with PremiumPrice were statistically significant
There is a correlation between Age and NumberOfMajorSurgeries, but we can ignore it for now as we can check it later during multicollinearity-checking
Correlation between categorical predictors and target variable (from boxplot): there are variables whose distributions are not much different for each level (or it seems that these categorical predictors -such as Diabetes, KnownAllergies and BloodPressureProblems- have no correlation and no effect to the target), but we can ignore them for now as we can check them later in the models

8. Train-Test Splitting

Divide our dataset into train set and test set respectively with a ratio of 80:20.

# train-test splitting
RNGkind(sample.kind = "Rounding")
set.seed(100)

insample <- sample(nrow(insurance_clean), nrow(insurance_clean)*0.8)
insurance_train <- insurance_clean[insample,]
insurance_test <- insurance_clean[-insample,]

9. Model Fitting

In this section, we will use the lm() function to create five models for comparison:

model_none: model without predictors
model_all: model with all predictors
model_backward: the model generated from backward stepwise regression process
model_forward: the model generated from forward stepwise regression process
model_both: the model generated from forward-backward stepwise regression process

9.1 Without Predictor

Create a model without using any predictor. In other words, the coefficient is only an intercept.

# model without any predictor
model_none <- lm(PremiumPrice ~ 1, data = insurance_train)
summary(model_none)

## 
## Call:
## lm(formula = PremiumPrice ~ 1, data = insurance_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9402.3 -1652.3  -902.3  4597.7 15597.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24402        223   109.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6260 on 787 degrees of freedom

9.2 All Predictors

Create a model with all predictors.

# model with all predictors
model_all <- lm(PremiumPrice ~ ., data = insurance_train)
summary(model_all)

## 
## Call:
## lm(formula = PremiumPrice ~ ., data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13713.9  -2151.4   -296.5   1806.8  21416.6 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5510.43     768.73   7.168 1.77e-12 ***
## Age                        337.90      10.59  31.916  < 2e-16 ***
## Diabetes1                 -568.73     274.92  -2.069   0.0389 *  
## BloodPressureProblems1     345.69     274.65   1.259   0.2085    
## AnyTransplants1           8329.90     566.53  14.703  < 2e-16 ***
## AnyChronicDiseases1       2689.03     352.25   7.634 6.68e-14 ***
## KnownAllergies1            442.26     320.36   1.381   0.1678    
## HistoryOfCancerInFamily1  2514.17     413.16   6.085 1.82e-09 ***
## NumberOfMajorSurgeries    -891.68     205.57  -4.338 1.63e-05 ***
## BMI                        150.36      22.47   6.691 4.24e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3672 on 778 degrees of freedom
## Multiple R-squared:  0.6599, Adjusted R-squared:  0.656 
## F-statistic: 167.7 on 9 and 778 DF,  p-value: < 2.2e-16

9.3 Backward Stepwise Regression

Create a model with backward stepwise regression process

# model with backward stepwise regression process 
model_backward <- step(object = model_all, direction = "backward", trace = F)
summary(model_backward)

## 
## Call:
## lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants + 
##     AnyChronicDiseases + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
##     BMI, data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13551.7  -2173.0   -361.1   1835.8  21568.3 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5640.40     766.24   7.361 4.64e-13 ***
## Age                        338.96      10.50  32.286  < 2e-16 ***
## Diabetes1                 -577.72     272.65  -2.119   0.0344 *  
## AnyTransplants1           8305.87     566.58  14.660  < 2e-16 ***
## AnyChronicDiseases1       2682.18     352.27   7.614 7.69e-14 ***
## HistoryOfCancerInFamily1  2567.94     412.30   6.228 7.70e-10 ***
## NumberOfMajorSurgeries    -816.13     201.58  -4.049 5.66e-05 ***
## BMI                        151.72      22.45   6.757 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared:  0.6584, Adjusted R-squared:  0.6553 
## F-statistic: 214.8 on 7 and 780 DF,  p-value: < 2.2e-16

9.4 Forward Stepwise Regression

Create a model with forward stepwise regression process

# model with forward stepwise regression process 
model_forward <- step(object = model_none, 
                      direction = "forward", 
                      scope = list(upper = model_all), 
                      trace = F)
summary(model_forward)

## 
## Call:
## lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases + 
##     BMI + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
##     Diabetes, data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13551.7  -2173.0   -361.1   1835.8  21568.3 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5640.40     766.24   7.361 4.64e-13 ***
## Age                        338.96      10.50  32.286  < 2e-16 ***
## AnyTransplants1           8305.87     566.58  14.660  < 2e-16 ***
## AnyChronicDiseases1       2682.18     352.27   7.614 7.69e-14 ***
## BMI                        151.72      22.45   6.757 2.75e-11 ***
## HistoryOfCancerInFamily1  2567.94     412.30   6.228 7.70e-10 ***
## NumberOfMajorSurgeries    -816.13     201.58  -4.049 5.66e-05 ***
## Diabetes1                 -577.72     272.65  -2.119   0.0344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared:  0.6584, Adjusted R-squared:  0.6553 
## F-statistic: 214.8 on 7 and 780 DF,  p-value: < 2.2e-16

9.5 Forward-Backward Stepwise Regression

Create a model with forward stepwise regression process

# model with forward-backward stepwise regression process 
model_both <- step(object = model_none, 
                   direction = "both", 
                   scope = list(upper = model_all), 
                   trace = F)
summary(model_both)

## 
## Call:
## lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases + 
##     BMI + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
##     Diabetes, data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13551.7  -2173.0   -361.1   1835.8  21568.3 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5640.40     766.24   7.361 4.64e-13 ***
## Age                        338.96      10.50  32.286  < 2e-16 ***
## AnyTransplants1           8305.87     566.58  14.660  < 2e-16 ***
## AnyChronicDiseases1       2682.18     352.27   7.614 7.69e-14 ***
## BMI                        151.72      22.45   6.757 2.75e-11 ***
## HistoryOfCancerInFamily1  2567.94     412.30   6.228 7.70e-10 ***
## NumberOfMajorSurgeries    -816.13     201.58  -4.049 5.66e-05 ***
## Diabetes1                 -577.72     272.65  -2.119   0.0344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared:  0.6584, Adjusted R-squared:  0.6553 
## F-statistic: 214.8 on 7 and 780 DF,  p-value: < 2.2e-16

10. Model Performance on Train Set

Compare the performance of the five models that have been generated previously. Look at their Multiple/Adjusted R-squared and RMSE values.

# model performance comparison (based on train set)
model_comparison1 <- compare_performance(model_none, 
                                         model_all, 
                                         model_backward, 
                                         model_forward, 
                                         model_both)
as.data.frame(model_comparison1) %>% 
  select("Name","Model","AIC","R2","R2_adjusted","RMSE")

11. Prediction on Unseen Data

Use all five models to predict PremiumPrice, where the predictor values are wrapped in the test set.

11.1 Based on model_none

# predict based on model_none
predict_model_none <- predict(object = model_none, newdata = insurance_test)

11.2 Based on model_all

# predict based on model_all
predict_model_all <- predict(object = model_all, newdata = insurance_test)

11.3 Based on model_backward

# predict based on model_backward
predict_model_backward <- predict(object = model_backward, newdata = insurance_test)

11.4 Based on model_forward

# predict based on model_forward
predict_model_forward <- predict(object = model_forward, newdata = insurance_test)

11.5 Based on model_both

# predict based on model_both
predict_model_both <- predict(object = model_both, newdata = insurance_test)

12. Model Performance on Test Set

Let’s check the RMSE values of models on testing data.

# model performance comparison (based on test set)
Name <- c("model_none", "model_all", "model_backward", "model_forward", "model_both")
rmse_model_none <- RMSE(y_pred = predict_model_none, y_true = insurance_test$PremiumPrice)
rmse_model_all <- RMSE(y_pred = predict_model_all, y_true = insurance_test$PremiumPrice)
rmse_model_backward <- RMSE(y_pred = predict_model_backward, y_true = insurance_test$PremiumPrice)
rmse_model_forward <- RMSE(y_pred = predict_model_forward, y_true = insurance_test$PremiumPrice)
rmse_model_both <- RMSE(y_pred = predict_model_both, y_true = insurance_test$PremiumPrice)
RMSE <- c(rmse_model_none, rmse_model_all, rmse_model_backward, rmse_model_forward, rmse_model_both)
data.frame(Name, RMSE)

# data distribution (as a reference only)

# data range of target variable
range(insurance_clean$PremiumPrice)

## [1] 15000 40000

# data distribution (as a reference only)

# standard deviation of target variable
sd(insurance_clean$PremiumPrice)

## [1] 6248.184

13. Model Selection

Based on the results above, the following insights can be obtained:

💭 Best Model: model_backward

❓ Why:

The reasons are:

Based on model performance on training set, model_backward, model_forward, and model_both have the same values in all metrics. They have the smallest AIC, although their Adjusted R-squared values are slightly smaller and their RMSEs are slightly larger than the rest. They also have the same predictors, intercept, and coefficients.
Based on model performance on testing set, model_backward, model_forward, and model_both also have the same values in RMSE. They have the smallest RMSEs if we compare them to the other models.
Basically, we’re going to focus more on the model’s ability to predict new-unseen data than the historical data we’ve used in training. So, in this case, we will pay attention to the model performance on testing set, where model_backward, model_forward, model_both have the smallest RMSE values. Their Adjusted R-Squared values (based on the model performance on training set) are also slightly smaller than the rest of the models, so it’s not a problem. Because all values in model_backward, model_forward, model_both are the same, we can choose one of these models. In this case, I choose model_backward.
In model_backward, all predictors were included in the model except KnownAllergies and BloodPressureProblems. This has become our concern at the beginning - when we do a correlation test between the categorical predictor variables and the target, the PremiumPrice distribution does not look too different for each categorical level of those two variables (it seems that these categorical predictors have no correlation and no effect to the target).
The RMSEs from the model performance on training set are mostly smaller than in those on testing set, indicating that the models are bad at predicting new data. The RMSEs are also quite large when compared to the original data distribution (standard deviation around 6248). There is still room for improvement.

But before we tune the model_backward, let’s check for the assumptions first.

14. Model Assumptions

14.1 Linearity

Using linearity hypothesis test:

H0: there is no significant correlation
H1: there is significant correlation

Conclusion: We have already tested the correlation between each numerical predictor variables and the target variable - all the predictor variables used in model_backward are significantly correlated with the target variable. Assumption fulfilled.

14.2 Normality of Errors

Using Shapiro-Wilk hypothesis test:

H0: errors are normally-distributed
H1: errors are not normally-distributed

# check normality
check_normality(model_backward)

## Warning: Non-normality of residuals detected (p < .001).

Conclusion: p-value < 0.05, then reject H0. Errors are not normally distributed. Assumption is not fulfilled.

14.3 Homoscedasticity of Errors

Using Breusch-Pagan hypothesis test:

H0: homoscedasticity is present (the errors are distributed with equal variance)
H1: heteroscedasticity is present (the errors are not distributed with equal variance)

# check heteroscedasticity
check_heteroscedasticity(model_backward)

## OK: Error variance appears to be homoscedastic (p = 0.101).

Conclusion: p-value > 0.05, then there is not enough evidence to reject H0. Errors are distributed with equal variance (homoscedasticity is present). Assumption fulfilled.

14.4 No Multicollinearity

Check VIF (Variance Inflation Factor) value for each predictor variable with the following conditions:

VIF value > 10: multicollinearity occurs in the model
VIF value < 10: there is no multicollinearity in the model

# check multicollinearity
data.frame(check_collinearity(model_backward))

Conclusion: all VIF values < 10, then there is no multicollinearity in the model. Assumption fulfilled.

14.5 Recap

Based on the assumption tests that have been done, which assumptions have been met?

The correlation between target variable and each predictor variable must be linear
Errors are normally-distributed
Errors are distributed with equal variance (homoscedasticity is present)
There is no multicollinearity in the model

15. Additional: Model Tuning

Model tuning is the process of improving performance in various ways, such as feature engineering, removing outliers, transforming, etc. The goal is to minimize the error or fulfill certain assumptions.

15.1 Target Transformation

Target transformation is done in order to make data become more suitable for normal distribution. In addition, most data scientists believe that they get more accurate results when they perform target transformation. This time, we will use the square-root transformation.

# target transformation
insurance_train_tune <- insurance_train %>% 
  mutate(sqrt_PremiumPrice = sqrt(PremiumPrice)) %>% 
  select(-PremiumPrice)

See below for the evidence that the sqrt function can remove outliers.

par(mar = c(2, 3, 2, 2))

# before transformation
boxplot(insurance_train$PremiumPrice,
        main = "Distribution of Premium Price (Before Transformation)")

# after transformation
boxplot(insurance_train_tune$sqrt_PremiumPrice,
        main = "Distribution of Premium Price (After Transformation)")

15.2 Re-modeling

Create five new models using the transformed data. The five models have the following details:

model_none_tune: model without predictors
model_all_tune: model with all predictors
model_backward_tune: the model generated from backward stepwise regression process
model_forward_tune: the model generated from forward stepwise regression process
model_both_tune: the model generated from forward-backward stepwise regression process

# re-modeling
model_none_tune <- lm(sqrt_PremiumPrice ~ 1, data = insurance_train_tune)
model_all_tune <- lm(sqrt_PremiumPrice ~ ., data = insurance_train_tune)
model_backward_tune <- step(object = model_all_tune, direction = "backward", trace = F)
model_forward_tune <- step(object = model_none_tune, direction = "forward", scope = list(upper = model_all_tune), trace = F)
model_both_tune <- step(object = model_none_tune, direction = "both", scope = list(upper = model_all_tune), trace = F)

15.3 Prediction using New Models

Make predictions using five new models.

# prediction using training set predictors
predict_model_none_tune_train <- (model_none_tune$fitted.values)^2
predict_model_all_tune_train <- (model_all_tune$fitted.values)^2
predict_model_backward_tune_train <- (model_backward_tune$fitted.values)^2
predict_model_forward_tune_train <- (model_forward_tune$fitted.values)^2
predict_model_both_tune_train <- (model_both_tune$fitted.values)^2

# prediction using testing set predictors
predict_model_none_tune_test <- (predict(object = model_none_tune, newdata = insurance_test))^2
predict_model_all_tune_test <- (predict(object = model_all_tune, newdata = insurance_test))^2
predict_model_backward_tune_test <- (predict(object = model_backward_tune, newdata = insurance_test))^2
predict_model_forward_tune_test <- (predict(object = model_forward_tune, newdata = insurance_test))^2
predict_model_both_tune_test <- (predict(object = model_both_tune, newdata = insurance_test))^2

15.4 New Models’ Performance

Calculate the RMSE of Models’ Predictions.

# RMSE on training set
rmse_none_tune_train <- RMSE(y_pred = predict_model_none_tune_train, y_true = insurance_train$PremiumPrice)
rmse_all_tune_train <- RMSE(y_pred = predict_model_all_tune_train, y_true = insurance_train$PremiumPrice)
rmse_backward_tune_train <- RMSE(y_pred = predict_model_backward_tune_train, y_true = insurance_train$PremiumPrice)
rmse_forward_tune_train <- RMSE(y_pred = predict_model_forward_tune_train, y_true = insurance_train$PremiumPrice)
rmse_both_tune_train <- RMSE(y_pred = predict_model_both_tune_train, y_true = insurance_train$PremiumPrice)

# RMSE on testing set
rmse_none_tune_test <- RMSE(y_pred = predict_model_none_tune_test, y_true = insurance_test$PremiumPrice)
rmse_all_tune_test <- RMSE(y_pred = predict_model_all_tune_test, y_true = insurance_test$PremiumPrice)
rmse_backward_tune_test <- RMSE(y_pred = predict_model_backward_tune_test, y_true = insurance_test$PremiumPrice)
rmse_forward_tune_test <- RMSE(y_pred = predict_model_forward_tune_test, y_true = insurance_test$PremiumPrice)
rmse_both_tune_test <- RMSE(y_pred = predict_model_both_tune_test, y_true = insurance_test$PremiumPrice)

# dataframe containing rmse
model <- c("model_none_tune", "model_all_tune", "model_backward_tune", "model_forward_tune", "model_both_tune")
rmse_train <- c(rmse_none_tune_train, rmse_all_tune_train, rmse_backward_tune_train, rmse_forward_tune_train, rmse_both_tune_train)
rmse_test <- c(rmse_none_tune_test, rmse_all_tune_test, rmse_backward_tune_test, rmse_forward_tune_test, rmse_both_tune_test)
data.frame(model, rmse_train, rmse_test)

Based on the table above, model_backward_tune can be chosen as the best model because it has the smallest RMSE value (rmse_test). However, it seems that the target transformation cannot reduce the RMSE. Still, let’s see if the tuning model can satisfy all the linear regression assumptions.

15.5 New Models’ Assumptions

# check normality
check_normality(model_backward_tune)

## Warning: Non-normality of residuals detected (p < .001).

# check heteroscedasticity
check_heteroscedasticity(model_backward_tune)

## OK: Error variance appears to be homoscedastic (p = 0.296).

# check multicollinearity
data.frame(check_collinearity(model_backward_tune))

Conclusion on assumption test:

Errors are normally-distributed
Errors are distributed with equal variance (homoscedasticity is present)
There is no multicollinearity in the model

15.6 Conclusion on Tuned Model

It seems that the tuned model still can’t improve the performance of our model (the RMSE actually gets bigger and there are still assumption that is not fulfilled). I’ve also tried log transformation on the target column, but the result is still the same. Therefore, because the untuned model gives a smaller RMSE value, it is better to use the untuned model (model_backward) for interpretation.

16. Conclusion and Interpretation

Best model: model_backward with summary as follows.

summary(model_backward)

## 
## Call:
## lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants + 
##     AnyChronicDiseases + HistoryOfCancerInFamily + NumberOfMajorSurgeries + 
##     BMI, data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13551.7  -2173.0   -361.1   1835.8  21568.3 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5640.40     766.24   7.361 4.64e-13 ***
## Age                        338.96      10.50  32.286  < 2e-16 ***
## Diabetes1                 -577.72     272.65  -2.119   0.0344 *  
## AnyTransplants1           8305.87     566.58  14.660  < 2e-16 ***
## AnyChronicDiseases1       2682.18     352.27   7.614 7.69e-14 ***
## HistoryOfCancerInFamily1  2567.94     412.30   6.228 7.70e-10 ***
## NumberOfMajorSurgeries    -816.13     201.58  -4.049 5.66e-05 ***
## BMI                        151.72      22.45   6.757 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared:  0.6584, Adjusted R-squared:  0.6553 
## F-statistic: 214.8 on 7 and 780 DF,  p-value: < 2.2e-16

and its formula: \[PremiumPrice = 5640.40 + 338.96 * Age - 577.72 * Diabetes1 (Diabetes: Yes) +\] \[8305.87 * AnyTransplants1 (AnyTransplants: Yes) +\] \[2682.18 * AnyChronicDiseases1 (AnyChronicDiseases: Yes) +\] \[ 2567.94 * HistoryOfCancerInFamily1 (HistoryOfCancerInFamily: Yes) -\] \[816.13 * NumberOfMajorSurgeries + 151.72 * BMI\]

Model Interpretation:

The older a person is, the higher the premium price (because Age has positive coefficient), assuming the values of other predictors are constant.
The greater the BMI value, the higher the premium price (because BMI has positive coefficient), assuming the values of other predictors are constant.
The greater the number of major surgeries that have been performed, the lower the premium price (because NumberofMajorSurgeries has the negative coefficient), assuming the values of other predictors are constant.
A person who is a diabetic will have cheaper charges, which is 577.72 Indian rupees cheaper than to those who are not diabetic, assuming the values of other predictors are constant.
A person who has had an organ transplant will have more expensive charges, which is 8305.87 Indian rupees more expensive than to those who have not had an organ transplant, assuming the values of other predictors are constant.
A person who has a chronic disease will have more expensive charges, which is 2682.18 Indian rupees more expensive than to those who do not have a chronic disease, assuming the values of other predictors are constant.
A person who has a blood relation to a cancer patient has more expensive charges, which is 2567.94 Indian rupees more expensive than to those who don’t have a blood relation to a cancer patient, assuming the value of other predictors is constant.
Variables AnyTransplants, AnyChronicDiseases, HistoryOfCancerInFamily seem to contribute greatly in determining the premium price because they have large coefficient values.
The predictors used in model can explain the variation of the premium price by 65.53%, where the remaining variation is explained by other factors that are not used in the modeling.

Concerns:

The Adjusted R-Square generated by model_backward is still relatively small. Hence, there is a possibility that some other variables out there can significantly affect the medical insurance premiums. In addition, the RMSE value is still relatively large, so it is very recommended to add other variables that can improve prediction accuracy.
Diabetes and NumberOfMajorSurgeries variables have negative coefficients, in which this does not match with our intuition and our logic. This situation can be studied further in the future.
model_backward returns errors that are not normally distributed - it does not meet the linear regression assumption. Actually, this assumption is optional as the Ordinary Least Squares (OLS) method doesn’t require a normally-distributed errors to produce unbiased estimates with minimum variance. However, the non-fulfillment of this assumption can make the results of statistical hypothesis testing (parameter significance test) and the calculation of confidence intervals become misleading. Therefore, this situation can be overcome by using other methods that may also increase the accuracy of predictions, e.g. nonparametric regression.