Introduction
Rising health care costs are a major economic and public health issue worldwide : According to the World Health Organization, health care accounted for 7.9% of Europe’s gross domestic product (GDP) in 2015. In Switzerland, the health care sector contributes substantially to the national GDP, and has increased from 10.7 to 12.1% between 2010 and 2015. Moreover, because health care utilisation costs may serve as a surrogate for an individual’s health status, understanding which factors contribute to increases in health expenditures may provide insight into risk factors and potential starting points for preventive measures.
Several studies have addressed the prediction of health care costs, approaching the issue as either a regression problem or a classification problem (classifying costs into predefined “buckets”). Previous studies also examined a broad variety of features. The most commonly used features include different sets of demographic features, health care utilisation parameters (e.g. hospitalisation or outpatient visits), drug codes, diagnosis codes, procedure codes, various chronic disease scores and cost features.
Understanding the data
In this study, we aim to predict the health charges of Bangladeshi citizens based on some important features such as their age, sex, BMI, number of children they have, etc. We approached the problem as a binary classification task, correlating each features with charges in order to capture different patterns of the data. Based on their chracteristics, we built two models: Linear Regression and Logistic Regression Model. Finally, we tried out some tests to examine the prediction which showed us good results.
We will access the dataset first and have a look at the data.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 5 x 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 26 female 22.6 0 no northwest 3177.
## 2 22 male 31.7 0 no northeast 2255.
## 3 32 female 37.1 3 no northeast 6334.
## 4 45 male 28.7 2 no southwest 8028.
## 5 19 female 31.8 1 no northwest 2719.
The data that we collected has 7 different variables. Each of them is stated below.
Age: Insurance contractor’s age
Sex: Insurance contractor’s gender, [female, male]
BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
Children: Number of children covered by health insurance / Number of dependents
Smoker: Smoking, [yes, no]
Region: The beneficiary’s residential area in Bangladesh [northeast, southeast, southwest, northwest]
Charges: Individual medical costs billed by health insurance
Objective of processing this dataset
Nowadays, having an idea about health insurance is quite compulsory for every family. We as citizens should know how much charges are made for incurring personal medical expenses, spreading the risk over numerous persons. By estimating the overall risk of health risk and health system expenses over the risk pool, an insurer can develop a routine finance structure, such as a monthly premium or payroll tax, to provide the money to pay for the health care benefits specified in the insurance agreement.
For such cases, considering both the company and the customers, we have two objectives or questions to find through this project:
1.Which factors affect the amount of Health Insurance?
2.How precisely the company or the customer will be able to estimate the amount of insurance?
Exploratory Data Analysis
Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process. Some experts describe it as “taking a peek” at the data to understand more about what it represents and how to apply it. It is often a precursor to other kinds of work with statistics and data.
Importing Libraries
Before we start our data analysis and modelling, we imported all the necessary libraries for visualization as well as building models.
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
##
## coalesce
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
To get a gist idea about the dataset, we will use
decribe()function that will provide us mean, lowest and highest values along with missing ones.
## Data
##
## 7 Variables 1418 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 1374 44 51 0.999 38.66 16.67 18 19
## .25 .50 .75 .90 .95
## 26 39 51 59 61
##
## lowest : 0 18 19 20 21, highest: 60 61 62 63 64
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 1412 6 2
##
## Value female male
## Frequency 694 718
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 1359 59 563 1 30.55 7.003 20.78 22.70
## .25 .50 .75 .90 .95
## 26.12 30.30 34.60 38.60 40.96
##
## lowest : 0.000 10.000 15.090 15.960 16.815, highest: 48.070 49.060 50.380 52.580 53.130
## --------------------------------------------------------------------------------
## children
## n missing distinct Info Mean Gmd
## 1415 3 6 0.891 1.063 1.266
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 633 329 247 162 25 19
## Proportion 0.447 0.233 0.175 0.114 0.018 0.013
## --------------------------------------------------------------------------------
## smoker
## n missing distinct
## 1412 6 2
##
## Value no yes
## Frequency 1096 316
## Proportion 0.776 0.224
## --------------------------------------------------------------------------------
## region
## n missing distinct
## 1364 54 4
##
## Value northeast northwest southeast southwest
## Frequency 334 327 374 329
## Proportion 0.245 0.240 0.274 0.241
## --------------------------------------------------------------------------------
## charges
## n missing distinct Info Mean Gmd .05 .10
## 1357 61 1355 1 13169 12277 1729 2239
## .25 .50 .75 .90 .95
## 4671 9288 16456 34790 41047
##
## lowest : 160.30 250.50 403.93 890.03 1020.34
## highest: 55135.40 58571.07 60021.40 62592.87 63770.43
## --------------------------------------------------------------------------------
Data Processing
In order to prepare the dataset for analysis we will have to correct every possible variables and contents in it. After looking at the description of the data, we can see its a little bit untidy as couple of missing values are there in every variables. So, is.na() function can show us the total number of missing value in our dataset.
## [1] 233
Using na.omit() function we tried to remove all the missing values with full rows. After that, we created a new dataset and named it Data1.
## [1] 0
No missing values at this point in the dataset and now we started our next step which is analysis.
Correlation Analysis
x <- ggplot(Data1, aes(age, charges)) +
geom_jitter(color = "blue", alpha = 0.5) +
theme_light()
y <- ggplot(Data1, aes(bmi, charges)) +
geom_jitter(color = "green", alpha = 0.5) +
theme_light()
p <- plot_grid(x, y)
title <- ggdraw() + draw_label("1. Correlation between Charges and Age / BMI", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))x <- ggplot(Data, aes(sex, charges)) +
geom_jitter(aes(color = sex), alpha = 0.7) +
theme_light()
y <- ggplot(Data, aes(children, charges)) +
geom_jitter(aes(color = children), alpha = 0.7) +
theme_light()
p <- plot_grid(x, y) ## Warning: Removed 61 rows containing missing values (geom_point).
## Warning: Removed 61 rows containing missing values (geom_point).
title <- ggdraw() + draw_label("2. Correlation between Charges and Sex / Children covered by insurance", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))x <- ggplot(Data, aes(smoker, charges)) +
geom_jitter(aes(color = smoker), alpha = 0.7) +
theme_light()
y <- ggplot(Data, aes(region, charges)) +
geom_jitter(aes(color = region), alpha = 0.7) +
theme_light()
p <- plot_grid(x, y) ## Warning: Removed 61 rows containing missing values (geom_point).
## Warning: Removed 61 rows containing missing values (geom_point).
title <- ggdraw() + draw_label("3. Correlation between Charges and Smoker / Region", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))Insights
Plot 1: As Age and BMI go up Charges for health insurance also trends up.
Plot 2: No obvious connection between Charges and Sex. Charges for insurance with 4-5 children covered seems to go down (doesn’t make sense, does it?)
Plot 3: Charges for Smokers are higher for non-smokers (no surprise here). No obvious connection between Charges and Region.
Preparing the dataset for modeling
This is the part where we split the data into training and testing dataset. We took 20% data for testing and 80% for training our model
After splitting the data we created one formula for our model, Where we took all the variables to compare with the label variable.
Linear Regression Model
If the goal is prediction, forecasting, or error reduction, then linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables as here we have independent variables such as age, bmi, children, etc and the dependent one is the charges.
After splitting the data we took training dataset and our first formula to apply in Linear Regression model with this we created our first model.
First Model
##
## Call:
## lm(formula = formula_0, data = Data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23051.5 -2952.4 -864.8 1619.1 30550.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13139.13 1079.71 -12.169 < 2e-16 ***
## age 256.52 13.35 19.220 < 2e-16 ***
## sexmale 86.90 372.50 0.233 0.81557
## bmi 365.24 31.98 11.421 < 2e-16 ***
## children 455.27 156.60 2.907 0.00372 **
## smokeryes 23317.23 460.74 50.609 < 2e-16 ***
## regionnorthwest 164.33 536.20 0.306 0.75930
## regionsoutheast -855.06 532.43 -1.606 0.10858
## regionsouthwest -544.89 532.27 -1.024 0.30620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1073 degrees of freedom
## Multiple R-squared: 0.7431, Adjusted R-squared: 0.7412
## F-statistic: 387.9 on 8 and 1073 DF, p-value: < 2.2e-16
#Saving R-squared
r_sq_0 <- summary(model_0)$r.squared
#predict data on test set
prediction_0 <- predict(model_0, newdata = Data_test)
#calculating the residuals
residuals_0 <- Data_test$charges - prediction_0
#calculating Root Mean Squared Error
rmse_0 <- sqrt(mean(residuals_0^2))As we can see, summary of the first model showed us that some of the variables are not significant (sex), while smoking seems to have a huge influence on charges. So, we tried to develop another model without non-significant variables and checked whether the performance could be improved or not.
Second Model
We created our second formula where we remove non-significant(sex) value and try to run our second model with training datasets
formula_1 <- as.formula("charges ~ age + bmi + children + smoker + region")
model_1 <- lm(formula_1, data = Data_train)
summary(model_1)##
## Call:
## lm(formula = formula_1, data = Data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23103.0 -2955.1 -849.1 1632.7 30589.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13104.09 1068.74 -12.261 < 2e-16 ***
## age 256.43 13.34 19.230 < 2e-16 ***
## bmi 365.59 31.93 11.449 < 2e-16 ***
## children 455.61 156.52 2.911 0.00368 **
## smokeryes 23326.05 458.98 50.822 < 2e-16 ***
## regionnorthwest 164.83 535.96 0.308 0.75849
## regionsoutheast -856.92 532.14 -1.610 0.10762
## regionsouthwest -542.73 531.95 -1.020 0.30784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6092 on 1074 degrees of freedom
## Multiple R-squared: 0.7431, Adjusted R-squared: 0.7414
## F-statistic: 443.7 on 7 and 1074 DF, p-value: < 2.2e-16
After using our second model, we can see now there is slightly improvement in Adjusted R-squared.
Logistic Regression Model
After completing our second model, now we are trying with different Algorithm which is Logistic Regression. For this algorithm we used one new formula where we remove another non-significant value and we create our third model
Third Model
formula_2 <- as.formula("charges ~ age + bmi + children + smoker")
model_2<-glm(formula_2, data = Data_train)
summary(model_2)##
## Call:
## glm(formula = formula_2, data = Data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -22825.4 -2955.9 -898.7 1604.4 30124.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12909.39 1032.83 -12.499 <2e-16 ***
## age 257.13 13.34 19.275 <2e-16 ***
## bmi 347.50 30.69 11.324 <2e-16 ***
## children 468.52 156.39 2.996 0.0028 **
## smokeryes 23256.40 457.07 50.882 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37162953)
##
## Null deviance: 1.5510e+11 on 1081 degrees of freedom
## Residual deviance: 4.0025e+10 on 1077 degrees of freedom
## AIC: 21938
##
## Number of Fisher Scoring iterations: 2
#Saving R-squared
r_sq_2 <- summary(model_2)$r.squared
#predict data on test set
prediction_2 <- predict(model_2, newdata = Data_test)
#calculating the residuals
residuals_2 <- Data_test$charges - prediction_2
#calculating Root Mean Squared Error
rmse_2 <- sqrt(mean(residuals_2^2))As we can see that this model is not behaving accordingly as we have ignored all the non-significant variables and kept only the highest significant ones, as a result the prediction is not upto the mark itself.
Fourth Model
At last, we created another model considering all types of variables regardless of their significance.
##
## Call:
## glm(formula = formula_1, data = Data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -23103.0 -2955.1 -849.1 1632.7 30589.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13104.09 1068.74 -12.261 < 2e-16 ***
## age 256.43 13.34 19.230 < 2e-16 ***
## bmi 365.59 31.93 11.449 < 2e-16 ***
## children 455.61 156.52 2.911 0.00368 **
## smokeryes 23326.05 458.98 50.822 < 2e-16 ***
## regionnorthwest 164.83 535.96 0.308 0.75849
## regionsoutheast -856.92 532.14 -1.610 0.10762
## regionsouthwest -542.73 531.95 -1.020 0.30784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37106986)
##
## Null deviance: 1.5510e+11 on 1081 degrees of freedom
## Residual deviance: 3.9853e+10 on 1074 degrees of freedom
## AIC: 21939
##
## Number of Fisher Scoring iterations: 2
#Saving R-squared
r_sq_3 <- summary(model_3)$r.squared
#predict data on test set
prediction_3 <- predict(model_3, newdata = Data_test)
#calculating the residuals
residuals_3 <- Data_test$charges - prediction_3
#calculating Root Mean Squared Error
rmse_3 <- sqrt(mean(residuals_3^2))This model worked pretty fine compared to the third one.
Comparing all the models
## [1] "R-squared for first model:0.7431"
## [1] "R-squared for new model: 0.7431"
## [1] "RMSE for first model: 6723.15"
## [1] "RMSE for second model: 6719.7"
## [1] "RMSE for third model: 6733.75"
## [1] "RMSE for fourth model: 6719.7"
As we can see, the performances are quite similar between three models, so we kept the second one since it’s a little bit simpler and more accurate.
Data_test$prediction <- predict(model_1, newdata = Data_test)
ggplot(Data_test, aes(x = prediction, y = charges)) +
geom_point(color = "blue", alpha = 0.7) +
geom_abline(color = "red") +
ggtitle("Prediction vs. Real values")Data_test$residuals <- Data_test$charges - Data_test$prediction
ggplot(data = Data_test, aes(x = prediction, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals), color = "blue", alpha = 0.7) +
geom_hline(yintercept = 0, linetype = 3, color = "red") +
ggtitle("Residuals vs. Linear model prediction")ggplot(Data_test, aes(x = residuals)) +
geom_histogram(bins = 15, fill = "blue") +
ggtitle("Histogram of residuals")We can see the errors in the model are closer to zero hence it shows that the model predicts quite well.
Testing the Final Model
Masrur <- data.frame(age = 19,
bmi = 27.9,
children = 0,
smoker = "yes",
region = "northwest")
print(paste0("Health care charges for Masrur: ", round(predict(model_1, Masrur), 2)))## [1] "Health care charges for Masrur: 25458.93"
Sam <- data.frame(age = 27,
bmi = 24.35,
children = 0,
smoker = "no",
region = "southeast")
print(paste0("Health care charges for Sam: ", round(predict(model_1, Sam), 2)))## [1] "Health care charges for Sam: 1864.75"
Lily <- data.frame(age = 30,
bmi = 31.2,
children = 0,
smoker = "no",
region = "northeast")
print(paste0("Health care charges for Lily: ", round(predict(model_1, Lily), 2)))## [1] "Health care charges for Lily: 5995.25"
Conclusion
In conclusion, we can state that throughout this project we have met our objectives. We have figured out the features or variables which play vital role in the amount of health insurance as well as the prediction model to help the companies and the customers. For further analysis, dataset with more features will provide comparatively more accurate outputs.