Insurance Cost Prediction Using Regression Analysis

This analysis seeks to understand the relationship between patient variables (age, sex, number of children, smoker or not, region, etc.) and the cost to the patient that is billed by insurance. Additionally, this analysis will understand if the cost billed by insurance can be predicted using the aforementioned variables.

Load the necessary libraries and read in the data set:

# load libraries
library(dplyr)
library(ggplot2)
library(broom)
library(grid)
library(gridExtra)
library(MASS)

# read data
insurance <- read.csv('insurance.csv')

# verify data
head(insurance)

Distribution of target variable and numeric variables:

# 2 by 2 grid for plots
par(mfrow = c(2, 2))

# histogram
hist(insurance$charges,
     main='Distribution of Insurance Charges',
     xlab='Insurance Charge Amount',
     col='lightblue2'
     )

# histogram
hist(insurance$age,
     main='Distribution of Patient Age',
     xlab='Age',
     col='lightblue2'
     )

# histogram
hist(insurance$bmi,
     main='Distribution of Patient BMI',
     xlab='BMI',
     col='lightblue2'
     )

# bar chart
barplot(table(insurance$children),
        main='Distribution of Number of Children',
        xlab='Number of Children',
        col='springgreen3'
        )

The distribution of insurance charge amount is right-skewed, with a majority of charges ranging between $0 and $15,000. It may be necessary to apply a log transformation to the target variable to obtain a more normal distribution.

The distribution of patient BMI is relatively normally distributed, with a slight right skew. Most patients’ BMI appear to range from 25 to 35.

Most patients in the data set do not have children. There are a few hundred patients that have 1, 2, or 3 children, and approximately 100 or less that have 4 or 5 children.

Multivariate analysis: Distribution of cost between categorical variables (box-plot):

# 2 by 2 grid for plots
par(mfrow = c(2, 2))

# box plot
boxplot(charges ~ sex, data=insurance, main="Dist. of Insurance Charge by Sex",
   col='darkorange')

# box plot
boxplot(charges ~ smoker, data=insurance, main="Dist. of Insurance Charge by Smoker (Y/N)",
   col='darkorange')

# box plot
boxplot(charges ~ region, data=insurance, main="Dist. of Insurance Charge by Region",
   col='darkorange')

# box plot
boxplot(charges ~ children, data=insurance, main="Dist. of Insurance Charge by Number of Children",
   col='darkorange')

The median charge between female and male is approximately equal, at about $10,000. The IQR for male charges has a wider range, meaning the distribution is slightly less skewed compared to female charges.

There is a significant difference in the distribution between smoking and non-smoking patients. The median charge for non-smoking patients is approximately $10,000, while the median charge for smoking patients is approximately $32,000.

The distribution of charges for each region is fairly similar, with a median of approximately $10,000. The IQR is slightly narrower for northwest and southwest regions.

The median charge is approximately the same for patients with any number of children. The IQR range appears slighly wider for patients with 2 or 3 children.

Multivariate analysis: Correlation heatmap:

# correlation matrix for numeric variables
round(cor(insurance[, c('charges', 'age', 'bmi', 'children')]), 2)

##          charges  age  bmi children
## charges     1.00 0.30 0.20     0.07
## age         0.30 1.00 0.11     0.04
## bmi         0.20 0.11 1.00     0.01
## children    0.07 0.04 0.01     1.00

There is not a strong correlation between the amount billed by insurance and either BMI or number of children. There is some moderate positive correlation between age and charges that may be usable in a model.

Multivariate analysis: Scatterplot

# scatterplot to view relationship between 2 numeric variables
age_scatter <- ggplot(insurance, aes(x=age, y=charges, col=smoker)) + 
  geom_point() + 
  ggtitle('Charges vs. Age vs. Smoker')

# scatterplot to view relationship between 2 numeric variables
bmi_scatter <- ggplot(insurance, aes(x=bmi, y=charges, col=smoker)) + 
  geom_point() + 
  ggtitle('Charges vs. BMI vs. Smoker')

# for 2 plots
grid.arrange(age_scatter, bmi_scatter, nrow=2, ncol=1, top="Multivariate Analysis")

Interestingly, there is a strong correlation between charges and age when the data is separated by smoking / non-smoking patients. This is indicative of an interaction between smoking and BMI, and also smoking and age. Therefore, interaction variables should be tested in the model.

There appears to be moderate correlation between BMI and charges after the data is partitioned by smoker.

Data transformation: 1 hot encode categorical variables and standardize numeric variables:

# one hot encoding
ohe_smoker <- c()
  
for (i in 1:nrow(insurance)) {
  if (insurance$smoker[i] == 'yes') {
    ohe_smoker <- c(ohe_smoker, 1)
  } else {
    ohe_smoker <- c(ohe_smoker, 0)
  }
}

# assign back to dataframe
insurance$smoker_binary <- ohe_smoker

# verify value counts
print(table(insurance$smoker))

## 
##   no  yes 
## 1064  274

print(table(insurance$smoker_binary))

## 
##    0    1 
## 1064  274

The number of smokers and non-smokers is correct between the columns in the data set.

Standardize numeric columns for feature scaling:

# standardization
insurance$std_age <- scale(insurance$age)
insurance$std_bmi <- scale(insurance$bmi)

Split into train and test data sets:

# train with 1,000 observations
train <- insurance[1:1000,]

# test on remaining 338 observations
test <- insurance[1001:nrow(insurance),]

Create multiple regression model:

# multiple regression
fit <- lm(log(charges) ~ smoker_binary * std_bmi + smoker_binary * std_age, data=train)

# get model statistics
summary(fit)

## 
## Call:
## lm(formula = log(charges) ~ smoker_binary * std_bmi + smoker_binary * 
##     std_age, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82257 -0.18745 -0.06226  0.08282  2.29754 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.7639839  0.0139923 626.345   <2e-16 ***
## smoker_binary          1.5371166  0.0315820  48.671   <2e-16 ***
## std_bmi                0.0002989  0.0143913   0.021    0.983    
## std_age                0.6083803  0.0138916  43.795   <2e-16 ***
## smoker_binary:std_bmi  0.2977196  0.0308845   9.640   <2e-16 ***
## smoker_binary:std_age -0.4793946  0.0322425 -14.868   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3963 on 994 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.8156 
## F-statistic: 884.4 on 5 and 994 DF,  p-value: < 2.2e-16

The model itself is significant, due to the p-value of the F-statistic being extremely low. Additionally, each coefficient in the model is significant (except for std_bmi), due to the low p-values. The adjusted R-Squared indicates that ~81% of the variance in the dependent variable is explained by the independent variables.

# get AIC
glance(fit)

The AIC is approximately 994.7, which is much lower than the AIC of a linear model which does not use a transformed target variable.

Check linear regression assumptions: normally distributed residuals and homoscedasticity of residuals

# distribution of residuals
resid_hist <- ggplot(fit, aes(x=fit$residuals)) + geom_histogram()

# variance of residuals
resid_fitted <- ggplot(fit, aes(x=fit$fitted.values, y=fit$residuals)) + 
  geom_point() + 
  ggtitle('Residuals vs. Fitted Values')

# for 2 plots
grid.arrange(resid_hist, resid_fitted, nrow=2, ncol=1, top="Regression Assumptions")

The residuals are somewhat normally distributed, with long tails on either side, indicating a few observations in which the model did not predict accurately. There appears to be heteroskedasticity in the residuals, as the variance appears to decrease as the fitted values increase. Additional data transformations may be considered to reduce heteroskedasticity.

Make predictions and get error metric MAE:

# make predictions using test data
predictions <- predict(fit, test)

# get inverse of log predictions
inv_log_preds <- exp(predictions)

# get error metric (Mean Absolute  Error)
print('Mean Absolute Error (MAE):')

## [1] "Mean Absolute Error (MAE):"

print(mean(abs(inv_log_preds - test$charges)))

## [1] 3123.504

On average, the model error for the test data set was about $3,124.

Conclusion:

Overall, the multiple regression model is significant, and has significant coefficients, with an adjusted R-squared value of 0.81, and a MAE of 3124. Additionally, the regression assumptions were somewhat met, but the residual variance could potentially be improved. The recommendation would be to explore adding additional data to the model, such as ‘visit type’ to indicate what medical services were being billed. This may improve model accuracy, instead of trying to predict solely on patient demographics.