Health Insurance Cost Prediction using Linear Regression and Logistic Regression Algorithm

Introduction

Rising health care costs are a major economic and public health issue worldwide : According to the World Health Organization, health care accounted for 7.9% of Europe’s gross domestic product (GDP) in 2015. In Switzerland, the health care sector contributes substantially to the national GDP, and has increased from 10.7 to 12.1% between 2010 and 2015. Moreover, because health care utilisation costs may serve as a surrogate for an individual’s health status, understanding which factors contribute to increases in health expenditures may provide insight into risk factors and potential starting points for preventive measures.

Several studies have addressed the prediction of health care costs, approaching the issue as either a regression problem or a classification problem (classifying costs into predefined “buckets”). Previous studies also examined a broad variety of features. The most commonly used features include different sets of demographic features, health care utilisation parameters (e.g. hospitalisation or outpatient visits), drug codes, diagnosis codes, procedure codes, various chronic disease scores and cost features.

Understanding the data

In this study, we aim to predict the health charges of Bangladeshi citizens based on some important features such as their age, sex, BMI, number of children they have, etc. We approached the problem as a binary classification task, correlating each features with charges in order to capture different patterns of the data. Based on their chracteristics, we built two models: Linear Regression and Logistic Regression Model. Finally, we tried out some tests to examine the prediction which showed us good results.

We will access the dataset first and have a look at the data.

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(123)
Data <- read_excel("insurance.xlsx")
sample_n(Data, 5)

## # A tibble: 5 x 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    26 female  22.6        0 no     northwest   3177.
## 2    22 male    31.7        0 no     northeast   2255.
## 3    32 female  37.1        3 no     northeast   6334.
## 4    45 male    28.7        2 no     southwest   8028.
## 5    19 female  31.8        1 no     northwest   2719.

The data that we collected has 7 different variables. Each of them is stated below.

Age: Insurance contractor’s age

Sex: Insurance contractor’s gender, [female, male]

BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

Children: Number of children covered by health insurance / Number of dependents

Smoker: Smoking, [yes, no]

Region: The beneficiary’s residential area in Bangladesh [northeast, southeast, southwest, northwest]

Charges: Individual medical costs billed by health insurance

Objective of processing this dataset

Nowadays, having an idea about health insurance is quite compulsory for every family. We as citizens should know how much charges are made for incurring personal medical expenses, spreading the risk over numerous persons. By estimating the overall risk of health risk and health system expenses over the risk pool, an insurer can develop a routine finance structure, such as a monthly premium or payroll tax, to provide the money to pay for the health care benefits specified in the insurance agreement.

For such cases, considering both the company and the customers, we have two objectives or questions to find through this project:

1.Which factors affect the amount of Health Insurance?

2.How precisely the company or the customer will be able to estimate the amount of insurance?

Exploratory Data Analysis

Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process. Some experts describe it as “taking a peek” at the data to understand more about what it represents and how to apply it. It is often a precursor to other kinds of work with statistics and data.

Importing Libraries

Before we start our data analysis and modelling, we imported all the necessary libraries for visualization as well as building models.

library(ggplot2)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(cowplot)
library(WVPlots)

## Loading required package: wrapr

## 
## Attaching package: 'wrapr'

## The following object is masked from 'package:dplyr':
## 
##     coalesce

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:Hmisc':
## 
##     impute

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

To get a gist idea about the dataset, we will use decribe() function that will provide us mean, lowest and highest values along with missing ones.

describe(Data)

## Data 
## 
##  7  Variables      1418  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1374       44       51    0.999    38.66    16.67       18       19 
##      .25      .50      .75      .90      .95 
##       26       39       51       59       61 
## 
## lowest :  0 18 19 20 21, highest: 60 61 62 63 64
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     1412        6        2 
##                         
## Value      female   male
## Frequency     694    718
## Proportion  0.492  0.508
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1359       59      563        1    30.55    7.003    20.78    22.70 
##      .25      .50      .75      .90      .95 
##    26.12    30.30    34.60    38.60    40.96 
## 
## lowest :  0.000 10.000 15.090 15.960 16.815, highest: 48.070 49.060 50.380 52.580 53.130
## --------------------------------------------------------------------------------
## children 
##        n  missing distinct     Info     Mean      Gmd 
##     1415        3        6    0.891    1.063    1.266 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency    633   329   247   162    25    19
## Proportion 0.447 0.233 0.175 0.114 0.018 0.013
## --------------------------------------------------------------------------------
## smoker 
##        n  missing distinct 
##     1412        6        2 
##                       
## Value         no   yes
## Frequency   1096   316
## Proportion 0.776 0.224
## --------------------------------------------------------------------------------
## region 
##        n  missing distinct 
##     1364       54        4 
##                                                   
## Value      northeast northwest southeast southwest
## Frequency        334       327       374       329
## Proportion     0.245     0.240     0.274     0.241
## --------------------------------------------------------------------------------
## charges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1357       61     1355        1    13169    12277     1729     2239 
##      .25      .50      .75      .90      .95 
##     4671     9288    16456    34790    41047 
## 
## lowest :   160.30   250.50   403.93   890.03  1020.34
## highest: 55135.40 58571.07 60021.40 62592.87 63770.43
## --------------------------------------------------------------------------------

Data Processing

In order to prepare the dataset for analysis we will have to correct every possible variables and contents in it. After looking at the description of the data, we can see its a little bit untidy as couple of missing values are there in every variables. So, is.na() function can show us the total number of missing value in our dataset.

sum(is.na(Data))

## [1] 233

Using na.omit() function we tried to remove all the missing values with full rows. After that, we created a new dataset and named it Data1.

Data1<-na.omit(Data)
sum(is.na(Data1))

## [1] 0

No missing values at this point in the dataset and now we started our next step which is analysis.

Correlation Analysis

x <- ggplot(Data1, aes(age, charges)) +
  geom_jitter(color = "blue", alpha = 0.5) +
    theme_light()

y <- ggplot(Data1, aes(bmi, charges)) +
  geom_jitter(color = "green", alpha = 0.5) +
  theme_light()

p <- plot_grid(x, y) 
title <- ggdraw() + draw_label("1. Correlation between Charges and Age / BMI", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

x <- ggplot(Data, aes(sex, charges)) +
  geom_jitter(aes(color = sex), alpha = 0.7) +
  theme_light()

y <- ggplot(Data, aes(children, charges)) +
  geom_jitter(aes(color = children), alpha = 0.7) +
  theme_light()

p <- plot_grid(x, y)

## Warning: Removed 61 rows containing missing values (geom_point).

## Warning: Removed 61 rows containing missing values (geom_point).

title <- ggdraw() + draw_label("2. Correlation between Charges and Sex / Children covered by insurance", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

x <- ggplot(Data, aes(smoker, charges)) +
  geom_jitter(aes(color = smoker), alpha = 0.7) +
  theme_light()

y <- ggplot(Data, aes(region, charges)) +
  geom_jitter(aes(color = region), alpha = 0.7) +
  theme_light()

p <- plot_grid(x, y)

## Warning: Removed 61 rows containing missing values (geom_point).

## Warning: Removed 61 rows containing missing values (geom_point).

title <- ggdraw() + draw_label("3. Correlation between Charges and Smoker / Region", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

Insights

Plot 1: As Age and BMI go up Charges for health insurance also trends up.

Plot 2: No obvious connection between Charges and Sex. Charges for insurance with 4-5 children covered seems to go down (doesn’t make sense, does it?)

Plot 3: Charges for Smokers are higher for non-smokers (no surprise here). No obvious connection between Charges and Region.

Preparing the dataset for modeling

This is the part where we split the data into training and testing dataset. We took 20% data for testing and 80% for training our model

After splitting the data we created one formula for our model, Where we took all the variables to compare with the label variable.

n_train <- round(0.8 * nrow(Data1))
train_indices <- sample(1:nrow(Data1), n_train)
Data_train <- Data1[train_indices, ]
Data_test <- Data1[-train_indices, ]

formula_0 <- as.formula("charges ~ age + sex + bmi + children + smoker + region")

Linear Regression Model

If the goal is prediction, forecasting, or error reduction, then linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables as here we have independent variables such as age, bmi, children, etc and the dependent one is the charges.

After splitting the data we took training dataset and our first formula to apply in Linear Regression model with this we created our first model.

First Model

model_0 <- lm(formula_0, data = Data_train)
summary(model_0)

## 
## Call:
## lm(formula = formula_0, data = Data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23051.5  -2952.4   -864.8   1619.1  30550.8 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -13139.13    1079.71 -12.169  < 2e-16 ***
## age                256.52      13.35  19.220  < 2e-16 ***
## sexmale             86.90     372.50   0.233  0.81557    
## bmi                365.24      31.98  11.421  < 2e-16 ***
## children           455.27     156.60   2.907  0.00372 ** 
## smokeryes        23317.23     460.74  50.609  < 2e-16 ***
## regionnorthwest    164.33     536.20   0.306  0.75930    
## regionsoutheast   -855.06     532.43  -1.606  0.10858    
## regionsouthwest   -544.89     532.27  -1.024  0.30620    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1073 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7412 
## F-statistic: 387.9 on 8 and 1073 DF,  p-value: < 2.2e-16

#Saving R-squared
r_sq_0 <- summary(model_0)$r.squared

#predict data on test set
prediction_0 <- predict(model_0, newdata = Data_test)
#calculating the residuals
residuals_0 <- Data_test$charges - prediction_0
#calculating Root Mean Squared Error
rmse_0 <- sqrt(mean(residuals_0^2))

As we can see, summary of the first model showed us that some of the variables are not significant (sex), while smoking seems to have a huge influence on charges. So, we tried to develop another model without non-significant variables and checked whether the performance could be improved or not.

Second Model

We created our second formula where we remove non-significant(sex) value and try to run our second model with training datasets

formula_1 <- as.formula("charges ~ age + bmi + children + smoker + region")

model_1 <- lm(formula_1, data = Data_train)
summary(model_1)

## 
## Call:
## lm(formula = formula_1, data = Data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23103.0  -2955.1   -849.1   1632.7  30589.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -13104.09    1068.74 -12.261  < 2e-16 ***
## age                256.43      13.34  19.230  < 2e-16 ***
## bmi                365.59      31.93  11.449  < 2e-16 ***
## children           455.61     156.52   2.911  0.00368 ** 
## smokeryes        23326.05     458.98  50.822  < 2e-16 ***
## regionnorthwest    164.83     535.96   0.308  0.75849    
## regionsoutheast   -856.92     532.14  -1.610  0.10762    
## regionsouthwest   -542.73     531.95  -1.020  0.30784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1074 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7414 
## F-statistic: 443.7 on 7 and 1074 DF,  p-value: < 2.2e-16

After using our second model, we can see now there is slightly improvement in Adjusted R-squared.

r_sq_1 <- summary(model_1)$r.squared

prediction_1 <- predict(model_1, newdata = Data_test)

residuals_1 <- Data_test$charges - prediction_1
rmse_1 <- sqrt(mean(residuals_1^2))

Logistic Regression Model

After completing our second model, now we are trying with different Algorithm which is Logistic Regression. For this algorithm we used one new formula where we remove another non-significant value and we create our third model

Third Model

formula_2 <- as.formula("charges ~ age + bmi + children + smoker")
model_2<-glm(formula_2, data = Data_train)
summary(model_2)

## 
## Call:
## glm(formula = formula_2, data = Data_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -22825.4   -2955.9    -898.7    1604.4   30124.7  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12909.39    1032.83 -12.499   <2e-16 ***
## age            257.13      13.34  19.275   <2e-16 ***
## bmi            347.50      30.69  11.324   <2e-16 ***
## children       468.52     156.39   2.996   0.0028 ** 
## smokeryes    23256.40     457.07  50.882   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37162953)
## 
##     Null deviance: 1.5510e+11  on 1081  degrees of freedom
## Residual deviance: 4.0025e+10  on 1077  degrees of freedom
## AIC: 21938
## 
## Number of Fisher Scoring iterations: 2

#Saving R-squared
r_sq_2 <- summary(model_2)$r.squared

#predict data on test set
prediction_2 <- predict(model_2, newdata = Data_test)
#calculating the residuals
residuals_2 <- Data_test$charges - prediction_2
#calculating Root Mean Squared Error
rmse_2 <- sqrt(mean(residuals_2^2))

As we can see that this model is not behaving accordingly as we have ignored all the non-significant variables and kept only the highest significant ones, as a result the prediction is not upto the mark itself.

Fourth Model

At last, we created another model considering all types of variables regardless of their significance.

model_3<-glm(formula_1, data = Data_train)
summary(model_3)

## 
## Call:
## glm(formula = formula_1, data = Data_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -23103.0   -2955.1    -849.1    1632.7   30589.1  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -13104.09    1068.74 -12.261  < 2e-16 ***
## age                256.43      13.34  19.230  < 2e-16 ***
## bmi                365.59      31.93  11.449  < 2e-16 ***
## children           455.61     156.52   2.911  0.00368 ** 
## smokeryes        23326.05     458.98  50.822  < 2e-16 ***
## regionnorthwest    164.83     535.96   0.308  0.75849    
## regionsoutheast   -856.92     532.14  -1.610  0.10762    
## regionsouthwest   -542.73     531.95  -1.020  0.30784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37106986)
## 
##     Null deviance: 1.5510e+11  on 1081  degrees of freedom
## Residual deviance: 3.9853e+10  on 1074  degrees of freedom
## AIC: 21939
## 
## Number of Fisher Scoring iterations: 2

#Saving R-squared
r_sq_3 <- summary(model_3)$r.squared

#predict data on test set
prediction_3 <- predict(model_3, newdata = Data_test)
#calculating the residuals
residuals_3 <- Data_test$charges - prediction_3
#calculating Root Mean Squared Error
rmse_3 <- sqrt(mean(residuals_3^2))

This model worked pretty fine compared to the third one.

Comparing all the models

print(paste0("R-squared for first model:", round(r_sq_0, 4)))

## [1] "R-squared for first model:0.7431"

print(paste0("R-squared for new model: ", round(r_sq_1, 4)))

## [1] "R-squared for new model: 0.7431"

print(paste0("RMSE for first model: ", round(rmse_0, 2)))

## [1] "RMSE for first model: 6723.15"

print(paste0("RMSE for second model: ", round(rmse_1, 2)))

## [1] "RMSE for second model: 6719.7"

print(paste0("RMSE for third model: ", round(rmse_2, 2)))

## [1] "RMSE for third model: 6733.75"

print(paste0("RMSE for fourth model: ", round(rmse_3, 2)))

## [1] "RMSE for fourth model: 6719.7"

As we can see, the performances are quite similar between three models, so we kept the second one since it’s a little bit simpler and more accurate.

Data_test$prediction <- predict(model_1, newdata = Data_test)
ggplot(Data_test, aes(x = prediction, y = charges)) + 
  geom_point(color = "blue", alpha = 0.7) + 
  geom_abline(color = "red") +
  ggtitle("Prediction vs. Real values")

Data_test$residuals <- Data_test$charges - Data_test$prediction

ggplot(data = Data_test, aes(x = prediction, y = residuals)) +
  geom_pointrange(aes(ymin = 0, ymax = residuals), color = "blue", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = 3, color = "red") +
  ggtitle("Residuals vs. Linear model prediction")

ggplot(Data_test, aes(x = residuals)) + 
  geom_histogram(bins = 15, fill = "blue") +
  ggtitle("Histogram of residuals")

GainCurvePlot(Data_test, "prediction", "charges", "Model")

We can see the errors in the model are closer to zero hence it shows that the model predicts quite well.

Testing the Final Model

Masrur <- data.frame(age = 19,
                  bmi = 27.9,
                  children = 0,
                  smoker = "yes",
                  region = "northwest")
print(paste0("Health care charges for Masrur: ", round(predict(model_1, Masrur), 2)))

## [1] "Health care charges for Masrur: 25458.93"

Sam <- data.frame(age = 27,
                   bmi = 24.35,
                   children = 0,
                   smoker = "no",
                   region = "southeast")
print(paste0("Health care charges for Sam: ", round(predict(model_1, Sam), 2)))

## [1] "Health care charges for Sam: 1864.75"

Lily <- data.frame(age = 30,
                   bmi = 31.2,
                   children = 0,
                   smoker = "no",
                   region = "northeast")
print(paste0("Health care charges for Lily: ", round(predict(model_1, Lily), 2)))

## [1] "Health care charges for Lily: 5995.25"

Conclusion

In conclusion, we can state that throughout this project we have met our objectives. We have figured out the features or variables which play vital role in the amount of health insurance as well as the prediction model to help the companies and the customers. For further analysis, dataset with more features will provide comparatively more accurate outputs.