With the constant increasing prices of healthcare in our country, and with the ever rising instances of diseases, health insurance today is a necessity. Health insurance provides people with a much needed financial backup at times of medical emergencies. Health risks and uncertainties are a part of life. One cannot plan and get sick but one can certainly be prepared for the financial aspect. One of the ways to be financially prepared against uncertain health risks is by buying health insurance.
Health insurance is a type of insurance coverage that pays for medical expenses incurred by the insured. Health insurance can reimburse the insured for expenses incurred from illness or injury, or pay the care provider directly. Purchasing a health insurance is an integral part of financial planning as health insurance provides people with a much needed financial backup at times of medical emergencies. However, the health insurance premium that one has to pay to avail varies for different people, and is based on a number of different factors.
In this report, we are going to analyse and predict what are the factors that affect health insurance premium. With that purpose, Linear Regression will be performed. The process includes Data Preparation, Exploratory Data Analysis, Building Model, Prediction & Model Performance, Model Assumption, Alternative Model and Conclusion.
# Library Input
library(dplyr)
library(GGally)
library(ggcorrplot)
library(MLmetrics)
library(performance)
library(ggplot2)
library(lmtest)
library(car)
library(RColorBrewer)
library(DT)# Data Input
insurance <- read.csv("data/insurance.csv")
head(insurance)## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
The data used in this report contains a list of insurance contract of 1.338 individuals. The insurance data used consists of several variables with the following details:
age: age of primary beneficiarysex: insurance contractor gender, female, malebmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9children: Number of children covered by health insurance / Number of dependentssmoker: Smoking or notregion: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwestcharges: Individual medical costs billed by health insurance# Data Coercion
insurance[ , c("sex", "smoker", "region")] <-
lapply(insurance[ , c("sex", "smoker", "region")], as.factor)
str(insurance)## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
# Checking Missing Values
colSums(is.na(insurance))## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
All data types have been converted to the desired data types and there’s no more missing value.
Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.
# Correlation
model.matrix(~0+., data=insurance) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2,
colors = c("#1B9E77", "white", "#D95F02")) As can be seen from the correlation plot, smoker, age and bmi are positively related to charges. So, we are going to analyse their relationship further.
# Scatter Plot 1
ggplot(insurance, aes(x=age, y=charges, shape=smoker, color=smoker)) +
geom_point() +
theme_minimal() +
scale_color_brewer(palette = "Dark2")As can be seen from the plot above, the older the age, the more expensive the charges. Also, it can be seen that smoker paid higher charges than non-smoker.
# Scatter Plot 2
ggplot(insurance, aes(x=bmi, y=charges, shape=smoker, color=smoker)) +
geom_point() +
theme_minimal() +
scale_color_brewer(palette = "Dark2")As can be seen from the plot above, the higher the bmi, the charge gets relatively more expensive. The same with the previous plot, it can be seen that smoker paid higher charges than non-smoker.
Before making the model, the Train-Test Split is necessary. In this process, we will split the data into train data set and test data set. The train data set will be used to build a linear regression model, while the test data set will be used to predict the target variable. We will took 80% of the data as the train data set and the rest will be used as the test data set.
# Splitting the data set
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index_ins <- sample(x = nrow(insurance) , size = nrow(insurance)*0.8)
ins_train <- insurance[index_ins , ]
ins_test <- insurance[-index_ins, ]# Building Model (all variables)
model_all <- lm(formula = charges ~ . , data = ins_train)
summary(model_all)##
## Call:
## lm(formula = charges ~ ., data = ins_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10565.5 -2842.7 -972.6 1366.4 30412.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12156.36 1103.66 -11.015 < 2e-16 ***
## age 254.38 13.19 19.284 < 2e-16 ***
## sexmale -65.79 368.74 -0.178 0.85842
## bmi 345.93 31.77 10.890 < 2e-16 ***
## children 463.06 151.97 3.047 0.00237 **
## smokeryes 23454.01 457.47 51.269 < 2e-16 ***
## regionnorthwest -156.13 533.18 -0.293 0.76971
## regionsoutheast -1001.14 533.88 -1.875 0.06104 .
## regionsouthwest -1188.41 530.61 -2.240 0.02532 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5998 on 1061 degrees of freedom
## Multiple R-squared: 0.7476, Adjusted R-squared: 0.7457
## F-statistic: 392.8 on 8 and 1061 DF, p-value: < 2.2e-16
Form the first model, it can be seen that the most significant variables are age, bmi, and smoker. Which is in accordance to our previous statement. As can be seen from the model summary, the adjusted R-squared is 0.7457. That means based on the predictor variables included in the model, 74.57% of them are able to explain the charges variable, the rest was explained by other variables that were not included in the model.
Next, we are going to do ‘Feature Selection’. Feature selection is the stage in selecting the variables to be used in the model. This process will evaluate the stepwise model using the AIC (Akaike Information Criterion/ Information Loss) value. AIC shows a lot of missing information on the model.
# Feature Selection (backward)
model_backward <- step(object = model_all, direction = "backward", trace = 0)
summary(model_backward)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = ins_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10597.7 -2822.3 -981.7 1375.2 30384.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12178.81 1095.96 -11.112 < 2e-16 ***
## age 254.39 13.19 19.294 < 2e-16 ***
## bmi 345.64 31.71 10.900 < 2e-16 ***
## children 462.07 151.80 3.044 0.00239 **
## smokeryes 23448.18 456.09 51.411 < 2e-16 ***
## regionnorthwest -154.11 532.82 -0.289 0.77246
## regionsoutheast -1001.87 533.62 -1.877 0.06073 .
## regionsouthwest -1188.67 530.37 -2.241 0.02522 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5996 on 1062 degrees of freedom
## Multiple R-squared: 0.7476, Adjusted R-squared: 0.7459
## F-statistic: 449.3 on 7 and 1062 DF, p-value: < 2.2e-16
Now, we will compare the two models we have produced. Too see if this action affect our model, we can check the Adjusted R-Squared value from our two previous models. The first model with complete variables has adjusted R-squared of 0.7457, meaning that the model can explain 74.57% of variance in the target variable (insurance charges). Meanwhile, our simpler model has adjusted R-squared of 74.59%. Which is not a big difference compared to the first model. This shows that it is safe to remove variables that has no significant coefficient values.
Thus, we are going to use the second model, which gives the formula: \[ \hat{y} = -12178.81 + 254.39*x_{age} + 345.64*x_{bmi} + 462.07*x_{children} \\ + 23448.18*x_{smokeryes} - 154.11*x_{regionnorthwest} \\ - 1001.87*x_{regionsoutheast} - 1188.67*x_{regionsouthwest} \]
Notes:
x_{smokeryes} will be “1” if the beneficiary smokes and “0” if they didn’tx_{regionnorthwest} will be “1” if beneficiary’s residential area is in northwest and “0” if it isn’tx_{regionsoutheast} will be “1” if beneficiary’s residential area is in southeast and “0” if it isn’tx_{regionsouthwest} will be “1” if beneficiary’s residential area is in southwest and “0” if it isn’tThe prediction will originally run on the test data set. But we will predict based on the train data set as well to check whether or not our model is over-fitted.
# Prediction of train data set
pred <- predict(model_backward, newdata = ins_train)
ins_train$prediction <- pred# Error on the train data set prediction
MAPE(y_pred = ins_train$prediction, y_true = ins_train$charges)*100## [1] 42.99068
# Prediction of test data set
prediction <- predict(model_backward, newdata = ins_test)
ins_test$prediction <- prediction# Error on the train data set prediction
MAPE(y_pred = ins_test$prediction, y_true = ins_test$charges)*100## [1] 38.74866
Based on the prediction’s error, we can conclude that the model is not overfit. As can be seen from the MAPE of test data set prediction, the error percentage is around 38.7% ; which is quite high.
Here are the prediction result of the test data set, with prediction is the prediction of the insurance charges, and charges is the actual insurance charges.
datatable(ins_test)Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions: Linear relationship, Multivariate normality, No or little multicollinearity, No auto-correlation and Homoscedasticity. Next we are going to see whether or not our model fulfilled these certain assumptions.
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots.
# Linearity Scatter Plot
resact <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)
ggplot(resact, aes(fitted, residual)) +
geom_point(aes(col = "#D95F02")) +
geom_smooth(aes(col = "#1B9E77")) +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank()) ## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Based on our pattern, it can be seen that there’s little to no linearity of our model.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with histogram or the Shapiro test.
hist(model_backward$residuals, col = "#1B9E77")# Shapiro Test
shapiro.test(model_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.89751, p-value < 2.2e-16
Based on the test result, our model does NOT fulfill the normality assumption. Thus, further data transformation will be needed.
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.The presence or absence of multicollinearity can be seen from the value of VIF (Variance Inflation Factor): When the VIF value is more than 10, it means that there is multicollinearity.
vif(model_backward)## GVIF Df GVIF^(1/(2*Df))
## age 1.011663 1 1.005815
## bmi 1.108823 1 1.053007
## children 1.004031 1 1.002014
## smoker 1.007939 1 1.003962
## region 1.109810 3 1.017516
Based on the values of VIF, since all VIF < 10, that means there’s no multicolinearity.
Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x). We can test the linear regression model for autocorrelation with the Durbin-Watson test.
durbinWatsonTest(model_backward)## lag Autocorrelation D-W Statistic p-value
## 1 0.04034081 1.915485 0.162
## Alternative hypothesis: rho != 0
The result shows that the null hypothesis is not rejected, meaning that our residual has no autocorrelation in it.
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). We can also check it using the Breusch-Pagan test.
plot(x = model_backward$fitted.values, y = model_backward$residuals, col = "#1B9E77")
abline(h = 0, col = "#D95F02", lty = 2)bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 102.58, df = 7, p-value < 2.2e-16
Based on the test, Because p-value < alpha, it means that the error variance does NOT spreads randomly or is heteroscedasticity. Thus, further data transformation is needed.
Based on the previous evaluation, we can conclude that the model doesn’t fulfill some of the assumptions, such as Linearity, Normality of Residual, and Homoscedasticity of Residual. Which indicates that further data transformation might be needed. However, after doing data transformation (scaling) and re-building the model, unfortunately, it didn’t improve the model and the model still didn’t fulfill those three assumptions. Which means this data set might not be large enough and there aren’t enough predictors to explain the target variables.
It is important to notice that, these assumption only need to be fulfilled if the purpose of making the linear regression model is to interpret or to see the effect of each predictor on the value of the target variable. If the purpose of the linear regression model is to make predictions, then the model assumptions are not required to be met.
As mentioned in the previous part, the model doesn’t fulfill some of the assumptions, such as Linearity, Normality of Residual, and Homoscedasticity of Residual. Although it’s still fine to use the model to do prediction, we are going to offer an alternative model to do so using Random Forest Regression.
# Library Input
library(randomForest)
library(caret)When creating a Random Forest model, we are not required to split the training-testing dataset up front. This is because the model can already estimate the performance of the model in unseen data.
# Building Model
model_rf <- randomForest(formula = charges ~ . , data = insurance, ntree = 1000, mtry = 3,
keep.forest = FALSE, importance = TRUE)
model_rf##
## Call:
## randomForest(formula = charges ~ ., data = insurance, ntree = 1000, mtry = 3, keep.forest = FALSE, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 21341708
## % Var explained: 85.44
As can be seen above, % Var explained is 85.44. Which means that the model can explain 85.44% of variance in the target variable (insurance charges). Meanwhile, our previous linear regression model has only 74.59%.
# Prediction
pred_rf <- model_rf$predicted
insurance_rf <- insurance
insurance_rf$prediction <- pred_rfHere are the prediction result of the data set, with prediction is the prediction of the insurance charges, and charges is the actual insurance charges.
datatable(insurance_rf)Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (the most important) in the random forest model:
importance <- as.data.frame(varImp(model_rf))
importance %>%
arrange(-Overall)## Overall
## smoker 323.343756
## age 157.152121
## bmi 156.224497
## children 29.771443
## region 13.349770
## sex -4.239235
By the result, we can see that smoker is the most important variable followed by age and then bmi.
The Linear Regression Model generated has Adjusted R-Squared value of 0.7459, meaning that the model can explain 74.59% of variance in the target variable (insurance charges). Which is quite good, but not high. The model unfortunately also does not fulfill some classic assumption such as Linearity, Normality of Residual, and Homoscedasticity of Residual. Thus, adding more data into the data set might be needed (the data set might not be large enough and there aren’t enough predictors to explain the target variables).
The alternative model, Random Forest Regression Model can explain 85.44% of variance in the target variable (insurance charges) compared to the previous linear regression model has only 74.59%.
However, both models agreed that that the most significant variables in predicting insurance charges are smoker, age and bmi. So, the insurance charges will be more expensive if the beneficiary is an active smoker and/ or is older, and/ or has a heavier BMI.