Prediction Model of Insurance Cost
Setup
A health insurance company’s financial success relies on generating more revenue than it incurs cost on the healthcare of its policyholders. However, forecasting medical expenses is challenging due to the unpredictability of costs associated with rare conditions. This project aims to precisely predict insurance costs by analyzing individuals’ data, such as age, Body Mass Index, smoking habits, and other factors. Furthermore, we will identify the key variable that has the most significant impact on insurance costs. These predictions can be utilized to develop actuarial tables, enabling the adjustment of yearly premiums based on anticipated treatment expenses. This essentially constitutes a regression problem.
Library
For analyzing the data set and developing a prediction model the below necessary library functions are loaded.
Data set
## 'data.frame': 1070 obs. of 7 variables:
## $ age : int 37 18 23 32 58 25 36 34 53 45 ...
## $ sex : chr "male" "male" "female" "male" ...
## $ bmi : num 34.1 34.4 36.7 35.2 32.4 ...
## $ children: int 4 0 2 2 1 2 1 1 0 5 ...
## $ smoker : chr "yes" "no" "yes" "no" ...
## $ region : chr "southwest" "southeast" "northeast" "southwest" ...
## $ charges : num 40182 1137 38512 4671 13019 ...
From the data set, we have the following variables:
- age: age of primary beneficiary.
- sex: insurance contractor gender, female, male.
- bmi: Body Mass Index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m2) using the ratio of height to weight, ideally 18.5 to 24.9.
- children: number of children covered by health insurance / number of dependents.
- smoker: smoking or not.
- region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
- charges: individual medical costs billed by health insurance.
Data Preparation
Changing Data Type
Here, The variables are not stored in correct data type. Variables
named as sex, smoker and region
has to be converted as factor datatype. Therefore,
Converting the variables in their appropriate datatype for further
analysis.
factorCol <- c('sex', 'smoker', 'region')
for (col in factorCol) {
train[[col]] <- as.factor(train[[col]])
}
str(train)## 'data.frame': 1070 obs. of 7 variables:
## $ age : int 37 18 23 32 58 25 36 34 53 45 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 1 1 2 1 2 2 ...
## $ bmi : num 34.1 34.4 36.7 35.2 32.4 ...
## $ children: int 4 0 2 2 1 2 1 1 0 5 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 1 4 1 2 3 1 3 3 ...
## $ charges : num 40182 1137 38512 4671 13019 ...
Finding Duplicate Data
## age sex bmi children smoker region charges
## 268 19 male 30.59 0 no northwest 1639.563
One row of duplicate data found in the data set. It is highly unlikely to have same demographic and health characteristics including charges for two different people. So, we can remove the duplicate data from the data set.
Exploratory Data Analysis (EDA)
Descriptive Statistics
## age sex bmi children smoker
## Min. :18.00 female:543 Min. :15.96 Min. :0.00 no :850
## 1st Qu.:26.00 male :526 1st Qu.:26.32 1st Qu.:0.00 yes:219
## Median :39.00 Median :30.40 Median :1.00
## Mean :39.11 Mean :30.73 Mean :1.08
## 3rd Qu.:51.00 3rd Qu.:34.80 3rd Qu.:2.00
## Max. :64.00 Max. :53.13 Max. :5.00
## region charges
## northeast:249 Min. : 1122
## northwest:253 1st Qu.: 4747
## southeast:294 Median : 9447
## southwest:273 Mean :13212
## 3rd Qu.:16587
## Max. :63770
The data set contains data of people aged between 18 to 64 years. However, most of the people are around 39 years of age. It has almost similar number of data based on their sex and region. The bmi of the people ranges between 15.96 to 53.13. Further, we can see that there is a major difference in the number of people in the category of smoking habit. Non-smoker people are greater in number than smokers. And the medical charges also varies from $1100 to $64000.
Distribution of Charges
ggplot(train, aes(x = charges)) +
geom_density(alpha = .5) +
labs(title = "Distribution of Charges",
x = "Charges",
y = "Density") +
theme(plot.title = element_text(color="navy",
hjust = .5,
size=20),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12),
)From the above distribution plot, we can see that, medical charges of
most of the people lies between $1000 to $10000. The distribution
declines toward right side which means that only very few people has
higher medical charges and among them most of their medical charges lies
in around $40000. To identify any categorical significance for this kind
of variation of distribution in charges, we have generated
boxplots based on different categorical features.
Box-plots
for (col in c('sex', 'region', 'children', 'smoker')) {
boxplot <- ggplot(train, aes(x = !!sym(col), y = charges, group = !!sym(col), colour = !!sym(col))) +
geom_boxplot(size = .8, show.legend = FALSE) +
ggtitle(glue::glue("Boxplot of Medical Charges as per {col}")) +
theme(plot.title = element_text(color = "navy", hjust = 0.5, size = 20),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.title = element_blank(),
legend.text = element_text(size = 12))
print(boxplot)
}Here, by looking into the boxplots we can see that there
isn’t any significant difference in the charges of sex and
region category. In the children category, we
can see a slight increase of charges with the increase of number of
children they have. But in the category of smoker , the
plot shows a significant difference. People who smoke has greater
medical charges compared to people who don’t smoke.
Distribution of Charges By Smoking Habit
ggplot(train, aes(x = charges, fill = smoker)) +
geom_density(alpha = .5) +
labs(title = "Distribution of Charges by Smoking Habit",
x = "Charges",
y = "Density") +
theme(plot.title = element_text(color="navy",
hjust = .5,
size=20),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12),
)The graph clearly shows that smokers has greater medical charges compared to non-smokers.
Medical Charges With Age, BMI and Children in The Category of Smokers
for (col in c('age', 'bmi', 'children')) {
scplot <- ggplot(train, aes(x = !!sym(col), y = charges, group = smoker, fill = smoker, colour = smoker)) +
geom_jitter() +
geom_smooth(method = 'lm', se = FALSE, linewidth = .8) +
ggtitle(glue::glue("Medical Charges by {col} in Smoker Category")) +
theme(plot.title = element_text(color = "navy", hjust = 0.5, size = 20),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.title = element_blank(),
legend.text = element_text(size = 12))
print(scplot)
}From the above plots we can see that medical charge of smokers are always higher than non-smokers irrespective to their age, BMI and number of children they have. And there is a positive trend showing that medical charges increases as age and BMI increases. However, there is no significant uptrend in medical charges between smokers and non-smokers who have children. But, people with more children tends to smoke less.
Correlation Heat Map
The Correlation map illustrated above shows significant correlation
between smoker and charges . The relationship
between other features are not as significant as these two features.
However, After smoker, there is some degree of correlation
between age and charges.
Linear Regression Analysis
Step 01: Exploring The Models
## Start: AIC=18667.88
## charges ~ age + sex + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## - region 3 1.3678e+08 4.0476e+10 18666
## - sex 1 4.3071e+04 4.0339e+10 18666
## <none> 4.0339e+10 18668
## - children 1 2.9468e+08 4.0633e+10 18674
## - bmi 1 4.1333e+09 4.4472e+10 18770
## - age 1 1.3330e+10 5.3669e+10 18971
## - smoker 1 9.5616e+10 1.3596e+11 19965
##
## Step: AIC=18665.5
## charges ~ age + sex + bmi + children + smoker
##
## Df Sum of Sq RSS AIC
## - sex 1 1.2092e+05 4.0476e+10 18664
## <none> 4.0476e+10 18666
## - children 1 2.8813e+08 4.0764e+10 18671
## - bmi 1 4.1471e+09 4.4623e+10 18768
## - age 1 1.3501e+10 5.3976e+10 18971
## - smoker 1 9.6274e+10 1.3675e+11 19965
##
## Step: AIC=18663.5
## charges ~ age + bmi + children + smoker
##
## Df Sum of Sq RSS AIC
## <none> 4.0476e+10 18664
## - children 1 2.8803e+08 4.0764e+10 18669
## - bmi 1 4.1517e+09 4.4627e+10 18766
## - age 1 1.3509e+10 5.3984e+10 18969
## - smoker 1 9.6514e+10 1.3699e+11 19965
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = train)
##
## Coefficients:
## (Intercept) age bmi children smokeryes
## -11905.1 254.9 320.6 429.9 23586.1
Here, we used step() function to identify the
best fit model by looking at the lower AIC score. The best fit model is
saved in train_lm.
Step 03: Evaluation of The Model Performance
mae <- MAE(test$charges, pred_test)
rmse <- RMSE(test$charges, pred_test)
cat("Mean Absolute Error (MAE):", mae, "\n")## Mean Absolute Error (MAE): 3941.464
## Root Mean Squared Error (RMSE): 5672.102
Mean Absolute Error (MAE): The MAE shows the average absolute difference between the actual and predicted values. Lower value of MAE indicates better model performance. Here, The MAE value derived from the model is 3941.464. It means that, on average, the model’s predictions on medical charges differs with the actual medical charges by approximately $3941.464.
Root Mean Squared Error (RMSE): The RMSE shows the square root of the average of the squared differences between the actual and predicted values. Just like MAE, The lower RMSE value indicates better model performance. Here, The RMSE value derived from the model is 5672.102. It means that, on average, the model’s predictions on medical charges deviate by $5672.102 from the actual medical charges.
Step 04: Evaluate The Model
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11734 -2983 -1004 1356 29708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11905.11 1060.63 -11.225 < 2e-16 ***
## age 254.87 13.52 18.844 < 2e-16 ***
## bmi 320.64 30.69 10.447 < 2e-16 ***
## children 429.86 156.22 2.752 0.00603 **
## smokeryes 23586.13 468.26 50.370 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6168 on 1064 degrees of freedom
## Multiple R-squared: 0.7359, Adjusted R-squared: 0.7349
## F-statistic: 741.3 on 4 and 1064 DF, p-value: < 2.2e-16
The summary of the model gives details about the coefficients and statistical significance of each predictor:
The intercept is estimated at -11905.11 meaning that Even the other features remain zero, there will be medical charge if -$11905.11 present which is not practical. However, In this case, a negative intercept doesn’t show a meaningful interpretation as charges can not be in negative form.
For every unit of increase in age will increase the medical charges by $254.87 approx.
For every unit of increase in BMI will result in an increase in medical charges by $320.64 approx.
For every unit of increase in child will raise the medical charges by $429.86 approx.
A smoker will have additional $23586.13 medical charges compared to a non-smoker.
The R-squared value of 0.7359 refers that the model
explains significant portion of 73.59% of the variance in the medical
charges.
Step 05: Checking The Validity of Linear Model Assumptions
Testing Linearity:
To identify any linear relationship between predictors and
charges taking the following hypothesis.
Null Hypothesis H0: The predictors does not correlate with
chargesAlternate Hypothesis H1: The predictors correlate with
charges
train_lin_test <- train %>% select(age, bmi, children, smoker) %>% mutate_all(~as.numeric(as.factor(.)))
train_lin_test$charges <- train$charges
for (col in c('age', 'bmi', 'children', 'smoker')) {
corr <- cor.test(train_lin_test[[col]], train_lin_test$charges)
print(round(corr$p.value, 4))
}## [1] 0
## [1] 0
## [1] 0.0062
## [1] 0
Here, the p-value of predictors and charges are below
alpha 0.05. So, we can reject the null hypothesis and
accept the alternate hypothesis stating that the predictors are
correlated with charges variable.
Residual Normality:
residuals <- residuals(train_lm)
ggplot(train_lm, aes(x = residuals)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Histograms of Residuals from Linear Regression Model,",
x = "Residuals",
y = "Frequency") +
theme(plot.title = element_text(color="navy",
hjust = .5,
size=20),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12),
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above histogram, a normal distribution of residuals is clearly visible near zero.
Checking Homoscedasticity:
In the Above plot, we can see that the spread of predicted values are fairly uniform. Hence, homoscedasticity is present in the linear regression model.