It is natural that we all will do whatever it takes to protect ourselves, our family, and our loved ones, especially in matters related to health. However, it is not uncommon that we also want to do it at a cost we can afford - fits within our financial budget.
A recent study from industry groups LIMRA and Life Happens found out that 102 million uninsured and underinsured Americans (which are 40% of the country’s adult population) believe that they need to purchase medical insurance or invest in more coverage. However, more than one in three put off buying coverage because they often think it costs more than it does. That matter can put their family at risk if they get in accident unexpectedly.
Many people believe medical insurance is out of reach financially, often because they don’t know how medical insurance factors impact pricing. Yet medical insurance is one of the most affordable parts of a strong financial plan. Since cost misunderstandings are a main reason why people don’t buy, let’s consider what really goes into the premium price of medical insurance.
In this project, we will focus on linear regression to investigate how the predictors influence the medical insurance premium price. In the future, you can also use the final equation/model to estimate the price of your insurance premium by putting in several conditions. So, without further ado, check this out~ 😉
⚠️ Disclaimer:
Predictions generated by the final equation/model are estimates only. In fact, there are another conditions that can affect the accuracy of predictions (such as year, exchange rate, inflation, insurance company rules, etc.).
My opinion: just focus on how each variable affects the premium price (whether the predictor variable give a positive effect/increasing premium price or give negative effect/lowering premium price)
This project uses the dataset taken from kaggle. This dataset contains information from 986 clients of an insurance company, including information regarding yearly premium prices.
The columns in the dataset consist of:
Age : the age of the client (in years)Diabetes : whether the client has abnormal bloodsugar
levels (0 = No, 1 = Yes)BloodPressureProblems : whether the client has abnormal
blood pressure levels (0 = No, 1 = Yes)AnyTransplants : whether the client has any major organ
transplants (0 = No, 1 = Yes)AnyChronicDiseases : whether the client suffers from
chronic ailments like asthma, etc. (0 = No, 1 = Yes)Height : the height of the client (in centimeter)Weight : the weight of the client (in kilogram)KnownAllergies : whether the client has any known
allergies (0 = No, 1 = Yes)HistoryOfCancerInFamily : whether any blood relative of
the client has had any form of cancer (0 = No, 1 = Yes)NumberOfMajorSurgeries : the number of major surgeries
that the client has hadPremiumPrice : yearly medical insurance premium price
(in Indian Rupees/INR(₹) Currency)Load all the libraries needed to work on this project:
# load libraries
library(dplyr) # for data transformation
library(DT) # to make a table in rmarkdown
library(ggplot2) # to make visualization
library(GGally) # for EDA or correlation check
library(performance) # for model comparison and assumption check
library(MLmetrics) # to calculate the error value# read data
insurance <- read.csv("data_input/insurance.csv", header=T, na.strings=c(""))Based on the table above, we can see that our dataset consists of 11 variables.
First of all, we have to check the structure of our dataset - whether the data types are correct or not.
# check data structure
str(insurance)## 'data.frame': 986 obs. of 11 variables:
## $ Age : int 45 60 36 52 38 30 33 23 48 38 ...
## $ Diabetes : int 0 1 1 1 0 0 0 0 1 0 ...
## $ BloodPressureProblems : int 0 0 1 1 0 0 0 0 0 0 ...
## $ AnyTransplants : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyChronicDiseases : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Height : int 155 180 158 183 166 160 150 181 169 182 ...
## $ Weight : int 57 73 59 93 88 69 54 79 74 93 ...
## $ KnownAllergies : int 0 0 0 0 0 1 0 1 1 0 ...
## $ HistoryOfCancerInFamily: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfMajorSurgeries : int 0 0 1 2 1 1 0 0 0 0 ...
## $ PremiumPrice : int 25000 29000 23000 28000 23000 23000 21000 15000 23000 23000 ...
There seems to be some mismatched data types. So, before we proceed to the next step, we must change these data types.
# change data type
insurance[,c("Diabetes","BloodPressureProblems","AnyTransplants","AnyChronicDiseases",
"KnownAllergies","HistoryOfCancerInFamily")] <-
lapply(insurance[,c("Diabetes","BloodPressureProblems",
"AnyTransplants","AnyChronicDiseases","KnownAllergies",
"HistoryOfCancerInFamily")], as.factor)We can also add a new column containing BMI (Body Mass Index) - a medical screening tool that measures the ratio of height to weight to estimate the amount of body fat.
# make BMI column
insurance <- insurance %>%
mutate(BMI = Weight/((Height/100)^2))Check the data structure again for one last time before we move on to the next step.
# take a glimpse of the data
glimpse(insurance)## Rows: 986
## Columns: 12
## $ Age <int> 45, 60, 36, 52, 38, 30, 33, 23, 48, 38, 60, 66~
## $ Diabetes <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0~
## $ BloodPressureProblems <fct> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0~
## $ AnyTransplants <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
## $ AnyChronicDiseases <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Height <int> 155, 180, 158, 183, 166, 160, 150, 181, 169, 1~
## $ Weight <int> 57, 73, 59, 93, 88, 69, 54, 79, 74, 93, 74, 67~
## $ KnownAllergies <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1~
## $ HistoryOfCancerInFamily <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ NumberOfMajorSurgeries <int> 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 1~
## $ PremiumPrice <int> 25000, 29000, 23000, 28000, 23000, 23000, 2100~
## $ BMI <dbl> 23.72529, 22.53086, 23.63403, 27.77031, 31.934~
Also check if there are any missing values.
# check missing values
colSums(is.na(insurance))## Age Diabetes BloodPressureProblems
## 0 0 0
## AnyTransplants AnyChronicDiseases Height
## 0 0 0
## Weight KnownAllergies HistoryOfCancerInFamily
## 0 0 0
## NumberOfMajorSurgeries PremiumPrice BMI
## 0 0 0
It looks like that the data types are already correct. Furthermore, there are no missing values in our dataset, hence, we can proceed to the next phase.
PremiumPriceAge, Diabetes,
BloodPressureProblems, AnyTransplants,
AnyChronicDiseases, Height,
Weight, KnownAllergies,
HistoryOfCancerInFamily,
NumberOfMajorSurgeries, BMIWe know that the BMI is the result of a calculation
between Height and Weight. So, by intuition,
we can remove Height and Weight from the
predictor variables as their values are already represented by the
BMI column.
# remove Height and Weight columns
insurance_clean <- insurance %>%
select(-c("Height","Weight"))Check whether our dataset has outliers.
# check outliers
ggplot(data = insurance_clean, mapping = aes(x = "", y = PremiumPrice)) +
geom_boxplot(fill = "#C2981D", color = "#415C23") +
xlab("") +
ylab("Medical Insurance Premium Price")It can be seen that our dataset has some outliers (although it’s not too much). Let’s see the value of these outliers.
# find out the value of the outliers
boxplot.stats(insurance_clean$PremiumPrice)$out## [1] 39000 40000 39000 39000 39000 39000
So, we can conclude that the premium prices become outliers if the values are more than 39000.
But for now, let’s just ignore these outliers, and we’ll talk about them later when tuning the model.
After finding out about outliers, we continue by checking the correlation between variables.
# check correlation using heatmap
ggcorr(insurance_clean, hjust = 1, layout.exp = 1, label = TRUE)# check correlation using correlation test
# a starred value means that there is a significant correlation
ggduo(insurance_clean,
c("Age","NumberOfMajorSurgeries","BMI"), "PremiumPrice",
types = list(continuous = "cor"))# check correlation between categoric variable and target variable
ggduo(insurance_clean,
c("Diabetes","BloodPressureProblems","AnyTransplants"),
"PremiumPrice")# check correlation between categoric variable and target variable
ggduo(insurance_clean,
c("AnyChronicDiseases","KnownAllergies","HistoryOfCancerInFamily"),
"PremiumPrice")
From the correlation check that has been done, we must pay
attention to a few things, such as:
Age,
NumberOfMajorSurgeries, BMI with
PremiumPrice were statistically significantAge and
NumberOfMajorSurgeries, but we can ignore it for now as we
can check it later during multicollinearity-checkingDiabetes, KnownAllergies and
BloodPressureProblems- have no correlation and no effect to
the target), but we can ignore them for now as we can check them later
in the modelsDivide our dataset into train set and test set respectively with a ratio of 80:20.
# train-test splitting
RNGkind(sample.kind = "Rounding")
set.seed(100)
insample <- sample(nrow(insurance_clean), nrow(insurance_clean)*0.8)
insurance_train <- insurance_clean[insample,]
insurance_test <- insurance_clean[-insample,]In this section, we will use the lm() function to create five models for comparison:
Create a model without using any predictor. In other words, the coefficient is only an intercept.
# model without any predictor
model_none <- lm(PremiumPrice ~ 1, data = insurance_train)
summary(model_none)##
## Call:
## lm(formula = PremiumPrice ~ 1, data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9402.3 -1652.3 -902.3 4597.7 15597.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24402 223 109.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6260 on 787 degrees of freedom
Create a model with all predictors.
# model with all predictors
model_all <- lm(PremiumPrice ~ ., data = insurance_train)
summary(model_all)##
## Call:
## lm(formula = PremiumPrice ~ ., data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13713.9 -2151.4 -296.5 1806.8 21416.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5510.43 768.73 7.168 1.77e-12 ***
## Age 337.90 10.59 31.916 < 2e-16 ***
## Diabetes1 -568.73 274.92 -2.069 0.0389 *
## BloodPressureProblems1 345.69 274.65 1.259 0.2085
## AnyTransplants1 8329.90 566.53 14.703 < 2e-16 ***
## AnyChronicDiseases1 2689.03 352.25 7.634 6.68e-14 ***
## KnownAllergies1 442.26 320.36 1.381 0.1678
## HistoryOfCancerInFamily1 2514.17 413.16 6.085 1.82e-09 ***
## NumberOfMajorSurgeries -891.68 205.57 -4.338 1.63e-05 ***
## BMI 150.36 22.47 6.691 4.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3672 on 778 degrees of freedom
## Multiple R-squared: 0.6599, Adjusted R-squared: 0.656
## F-statistic: 167.7 on 9 and 778 DF, p-value: < 2.2e-16
Create a model with backward stepwise regression process
# model with backward stepwise regression process
model_backward <- step(object = model_all, direction = "backward", trace = F)
summary(model_backward)##
## Call:
## lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants +
## AnyChronicDiseases + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
## BMI, data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13551.7 -2173.0 -361.1 1835.8 21568.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5640.40 766.24 7.361 4.64e-13 ***
## Age 338.96 10.50 32.286 < 2e-16 ***
## Diabetes1 -577.72 272.65 -2.119 0.0344 *
## AnyTransplants1 8305.87 566.58 14.660 < 2e-16 ***
## AnyChronicDiseases1 2682.18 352.27 7.614 7.69e-14 ***
## HistoryOfCancerInFamily1 2567.94 412.30 6.228 7.70e-10 ***
## NumberOfMajorSurgeries -816.13 201.58 -4.049 5.66e-05 ***
## BMI 151.72 22.45 6.757 2.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared: 0.6584, Adjusted R-squared: 0.6553
## F-statistic: 214.8 on 7 and 780 DF, p-value: < 2.2e-16
Create a model with forward stepwise regression process
# model with forward stepwise regression process
model_forward <- step(object = model_none,
direction = "forward",
scope = list(upper = model_all),
trace = F)
summary(model_forward)##
## Call:
## lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases +
## BMI + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
## Diabetes, data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13551.7 -2173.0 -361.1 1835.8 21568.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5640.40 766.24 7.361 4.64e-13 ***
## Age 338.96 10.50 32.286 < 2e-16 ***
## AnyTransplants1 8305.87 566.58 14.660 < 2e-16 ***
## AnyChronicDiseases1 2682.18 352.27 7.614 7.69e-14 ***
## BMI 151.72 22.45 6.757 2.75e-11 ***
## HistoryOfCancerInFamily1 2567.94 412.30 6.228 7.70e-10 ***
## NumberOfMajorSurgeries -816.13 201.58 -4.049 5.66e-05 ***
## Diabetes1 -577.72 272.65 -2.119 0.0344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared: 0.6584, Adjusted R-squared: 0.6553
## F-statistic: 214.8 on 7 and 780 DF, p-value: < 2.2e-16
Create a model with forward stepwise regression process
# model with forward-backward stepwise regression process
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = F)
summary(model_both)##
## Call:
## lm(formula = PremiumPrice ~ Age + AnyTransplants + AnyChronicDiseases +
## BMI + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
## Diabetes, data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13551.7 -2173.0 -361.1 1835.8 21568.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5640.40 766.24 7.361 4.64e-13 ***
## Age 338.96 10.50 32.286 < 2e-16 ***
## AnyTransplants1 8305.87 566.58 14.660 < 2e-16 ***
## AnyChronicDiseases1 2682.18 352.27 7.614 7.69e-14 ***
## BMI 151.72 22.45 6.757 2.75e-11 ***
## HistoryOfCancerInFamily1 2567.94 412.30 6.228 7.70e-10 ***
## NumberOfMajorSurgeries -816.13 201.58 -4.049 5.66e-05 ***
## Diabetes1 -577.72 272.65 -2.119 0.0344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared: 0.6584, Adjusted R-squared: 0.6553
## F-statistic: 214.8 on 7 and 780 DF, p-value: < 2.2e-16
Compare the performance of the five models that have been generated previously. Look at their Multiple/Adjusted R-squared and RMSE values.
# model performance comparison (based on train set)
model_comparison1 <- compare_performance(model_none,
model_all,
model_backward,
model_forward,
model_both)
as.data.frame(model_comparison1) %>%
select("Name","Model","AIC","R2","R2_adjusted","RMSE")Use all five models to predict PremiumPrice, where the
predictor values are wrapped in the test set.
# predict based on model_none
predict_model_none <- predict(object = model_none, newdata = insurance_test)# predict based on model_all
predict_model_all <- predict(object = model_all, newdata = insurance_test)# predict based on model_backward
predict_model_backward <- predict(object = model_backward, newdata = insurance_test)# predict based on model_forward
predict_model_forward <- predict(object = model_forward, newdata = insurance_test)# predict based on model_both
predict_model_both <- predict(object = model_both, newdata = insurance_test)Let’s check the RMSE values of models on testing data.
# model performance comparison (based on test set)
Name <- c("model_none", "model_all", "model_backward", "model_forward", "model_both")
rmse_model_none <- RMSE(y_pred = predict_model_none, y_true = insurance_test$PremiumPrice)
rmse_model_all <- RMSE(y_pred = predict_model_all, y_true = insurance_test$PremiumPrice)
rmse_model_backward <- RMSE(y_pred = predict_model_backward, y_true = insurance_test$PremiumPrice)
rmse_model_forward <- RMSE(y_pred = predict_model_forward, y_true = insurance_test$PremiumPrice)
rmse_model_both <- RMSE(y_pred = predict_model_both, y_true = insurance_test$PremiumPrice)
RMSE <- c(rmse_model_none, rmse_model_all, rmse_model_backward, rmse_model_forward, rmse_model_both)
data.frame(Name, RMSE)# data distribution (as a reference only)
# data range of target variable
range(insurance_clean$PremiumPrice)## [1] 15000 40000
# data distribution (as a reference only)
# standard deviation of target variable
sd(insurance_clean$PremiumPrice)## [1] 6248.184
Based on the results above, the following insights can be obtained:
💭 Best Model: model_backward
❓ Why:
The reasons are:
model_backward, model_forward, and
model_both have the same values in all metrics. They have
the smallest AIC, although their Adjusted R-squared values are slightly
smaller and their RMSEs are slightly larger than the rest. They also
have the same predictors, intercept, and coefficients.model_backward, model_forward, and
model_both also have the same values in RMSE. They have the
smallest RMSEs if we compare them to the other models.model_backward,
model_forward, model_both have the smallest
RMSE values. Their Adjusted R-Squared values (based on the model
performance on training set) are also slightly smaller than the rest of
the models, so it’s not a problem. Because all values in
model_backward, model_forward,
model_both are the same, we can choose one of these models.
In this case, I choose model_backward.model_backward, all predictors were included in the
model except KnownAllergies and
BloodPressureProblems. This has become our concern at the
beginning - when we do a correlation test between the categorical
predictor variables and the target, the PremiumPrice
distribution does not look too different for each categorical level of
those two variables (it seems that these categorical predictors have no
correlation and no effect to the target).But before we tune the model_backward, let’s check for
the assumptions first.
Using linearity hypothesis test:
Conclusion: We have already tested the correlation between each
numerical predictor variables and the target variable - all the
predictor variables used in model_backward are
significantly correlated with the target variable. Assumption
fulfilled.
Using Shapiro-Wilk hypothesis test:
# check normality
check_normality(model_backward)## Warning: Non-normality of residuals detected (p < .001).
Conclusion: p-value < 0.05, then reject H0. Errors are not normally distributed. Assumption is not fulfilled.
Using Breusch-Pagan hypothesis test:
# check heteroscedasticity
check_heteroscedasticity(model_backward)## OK: Error variance appears to be homoscedastic (p = 0.101).
Conclusion: p-value > 0.05, then there is not enough evidence to reject H0. Errors are distributed with equal variance (homoscedasticity is present). Assumption fulfilled.
Check VIF (Variance Inflation Factor) value for each predictor variable with the following conditions:
# check multicollinearity
data.frame(check_collinearity(model_backward))Conclusion: all VIF values < 10, then there is no multicollinearity in the model. Assumption fulfilled.
Based on the assumption tests that have been done, which assumptions have been met?
Model tuning is the process of improving performance in various ways, such as feature engineering, removing outliers, transforming, etc. The goal is to minimize the error or fulfill certain assumptions.
Target transformation is done in order to make data become more suitable for normal distribution. In addition, most data scientists believe that they get more accurate results when they perform target transformation. This time, we will use the square-root transformation.
# target transformation
insurance_train_tune <- insurance_train %>%
mutate(sqrt_PremiumPrice = sqrt(PremiumPrice)) %>%
select(-PremiumPrice)See below for the evidence that the sqrt function can remove outliers.
par(mar = c(2, 3, 2, 2))
# before transformation
boxplot(insurance_train$PremiumPrice,
main = "Distribution of Premium Price (Before Transformation)")
# after transformation
boxplot(insurance_train_tune$sqrt_PremiumPrice,
main = "Distribution of Premium Price (After Transformation)")Create five new models using the transformed data. The five models have the following details:
# re-modeling
model_none_tune <- lm(sqrt_PremiumPrice ~ 1, data = insurance_train_tune)
model_all_tune <- lm(sqrt_PremiumPrice ~ ., data = insurance_train_tune)
model_backward_tune <- step(object = model_all_tune, direction = "backward", trace = F)
model_forward_tune <- step(object = model_none_tune, direction = "forward", scope = list(upper = model_all_tune), trace = F)
model_both_tune <- step(object = model_none_tune, direction = "both", scope = list(upper = model_all_tune), trace = F)Make predictions using five new models.
# prediction using training set predictors
predict_model_none_tune_train <- (model_none_tune$fitted.values)^2
predict_model_all_tune_train <- (model_all_tune$fitted.values)^2
predict_model_backward_tune_train <- (model_backward_tune$fitted.values)^2
predict_model_forward_tune_train <- (model_forward_tune$fitted.values)^2
predict_model_both_tune_train <- (model_both_tune$fitted.values)^2# prediction using testing set predictors
predict_model_none_tune_test <- (predict(object = model_none_tune, newdata = insurance_test))^2
predict_model_all_tune_test <- (predict(object = model_all_tune, newdata = insurance_test))^2
predict_model_backward_tune_test <- (predict(object = model_backward_tune, newdata = insurance_test))^2
predict_model_forward_tune_test <- (predict(object = model_forward_tune, newdata = insurance_test))^2
predict_model_both_tune_test <- (predict(object = model_both_tune, newdata = insurance_test))^2Calculate the RMSE of Models’ Predictions.
# RMSE on training set
rmse_none_tune_train <- RMSE(y_pred = predict_model_none_tune_train, y_true = insurance_train$PremiumPrice)
rmse_all_tune_train <- RMSE(y_pred = predict_model_all_tune_train, y_true = insurance_train$PremiumPrice)
rmse_backward_tune_train <- RMSE(y_pred = predict_model_backward_tune_train, y_true = insurance_train$PremiumPrice)
rmse_forward_tune_train <- RMSE(y_pred = predict_model_forward_tune_train, y_true = insurance_train$PremiumPrice)
rmse_both_tune_train <- RMSE(y_pred = predict_model_both_tune_train, y_true = insurance_train$PremiumPrice)# RMSE on testing set
rmse_none_tune_test <- RMSE(y_pred = predict_model_none_tune_test, y_true = insurance_test$PremiumPrice)
rmse_all_tune_test <- RMSE(y_pred = predict_model_all_tune_test, y_true = insurance_test$PremiumPrice)
rmse_backward_tune_test <- RMSE(y_pred = predict_model_backward_tune_test, y_true = insurance_test$PremiumPrice)
rmse_forward_tune_test <- RMSE(y_pred = predict_model_forward_tune_test, y_true = insurance_test$PremiumPrice)
rmse_both_tune_test <- RMSE(y_pred = predict_model_both_tune_test, y_true = insurance_test$PremiumPrice)# dataframe containing rmse
model <- c("model_none_tune", "model_all_tune", "model_backward_tune", "model_forward_tune", "model_both_tune")
rmse_train <- c(rmse_none_tune_train, rmse_all_tune_train, rmse_backward_tune_train, rmse_forward_tune_train, rmse_both_tune_train)
rmse_test <- c(rmse_none_tune_test, rmse_all_tune_test, rmse_backward_tune_test, rmse_forward_tune_test, rmse_both_tune_test)
data.frame(model, rmse_train, rmse_test)Based on the table above, model_backward_tune can be
chosen as the best model because it has the smallest RMSE value
(rmse_test). However, it seems that the target transformation cannot
reduce the RMSE. Still, let’s see if the tuning model can satisfy all
the linear regression assumptions.
# check normality
check_normality(model_backward_tune)## Warning: Non-normality of residuals detected (p < .001).
# check heteroscedasticity
check_heteroscedasticity(model_backward_tune)## OK: Error variance appears to be homoscedastic (p = 0.296).
# check multicollinearity
data.frame(check_collinearity(model_backward_tune))Conclusion on assumption test:
It seems that the tuned model still can’t improve the performance of
our model (the RMSE actually gets bigger and there are still assumption
that is not fulfilled). I’ve also tried log transformation on the target
column, but the result is still the same. Therefore, because the untuned
model gives a smaller RMSE value, it is better to use the untuned model
(model_backward) for interpretation.
model_backward with summary as
follows.summary(model_backward)##
## Call:
## lm(formula = PremiumPrice ~ Age + Diabetes + AnyTransplants +
## AnyChronicDiseases + HistoryOfCancerInFamily + NumberOfMajorSurgeries +
## BMI, data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13551.7 -2173.0 -361.1 1835.8 21568.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5640.40 766.24 7.361 4.64e-13 ***
## Age 338.96 10.50 32.286 < 2e-16 ***
## Diabetes1 -577.72 272.65 -2.119 0.0344 *
## AnyTransplants1 8305.87 566.58 14.660 < 2e-16 ***
## AnyChronicDiseases1 2682.18 352.27 7.614 7.69e-14 ***
## HistoryOfCancerInFamily1 2567.94 412.30 6.228 7.70e-10 ***
## NumberOfMajorSurgeries -816.13 201.58 -4.049 5.66e-05 ***
## BMI 151.72 22.45 6.757 2.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3675 on 780 degrees of freedom
## Multiple R-squared: 0.6584, Adjusted R-squared: 0.6553
## F-statistic: 214.8 on 7 and 780 DF, p-value: < 2.2e-16
and its formula: \[PremiumPrice = 5640.40 + 338.96 * Age - 577.72 * Diabetes1 (Diabetes: Yes) +\] \[8305.87 * AnyTransplants1 (AnyTransplants: Yes) +\] \[2682.18 * AnyChronicDiseases1 (AnyChronicDiseases: Yes) +\] \[ 2567.94 * HistoryOfCancerInFamily1 (HistoryOfCancerInFamily: Yes) -\] \[816.13 * NumberOfMajorSurgeries + 151.72 * BMI\]
AnyTransplants,
AnyChronicDiseases, HistoryOfCancerInFamily
seem to contribute greatly in determining the premium price because they
have large coefficient values.model_backward is
still relatively small. Hence, there is a possibility that some other
variables out there can significantly affect the medical insurance
premiums. In addition, the RMSE value is still relatively large, so it
is very recommended to add other variables that can improve prediction
accuracy.Diabetes and NumberOfMajorSurgeries
variables have negative coefficients, in which this does not match with
our intuition and our logic. This situation can be studied further in
the future.model_backward returns errors that are not normally
distributed - it does not meet the linear regression assumption.
Actually, this assumption is optional as the Ordinary Least Squares
(OLS) method doesn’t require a normally-distributed errors to produce
unbiased estimates with minimum variance. However, the non-fulfillment
of this assumption can make the results of statistical hypothesis
testing (parameter significance test) and the calculation of confidence
intervals become misleading. Therefore, this situation can be overcome
by using other methods that may also increase the accuracy of
predictions, e.g. nonparametric regression.