Insurance Cost Forecasting

Tahir Arazi

8/16/2021

1 Introduction

Dataset used in this R Markdown is downloaded from https://www.kaggle.com/mirichoi0218/insurance/version/1. Dataset consist of 1338 observations with the following variables, as written in source website:

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance

Insurance cost is to be predicted using linear regression.

2 Read Data

Read dataset and check data type.

insurance <- read.csv("data/insurance.csv")

glimpse(insurance)

## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

3 Data Wrangling

Change char to factor

insurance %<>% mutate_if(is.character,as.factor)

glimpse(insurance)

## Rows: 1,338
## Columns: 7
## $ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex      <fct> female, male, male, male, male, female, female, female, male,~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
## $ region   <fct> southwest, southeast, southeast, northwest, northwest, southe~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

Check if NA exist

anyNA(insurance)

## [1] FALSE

4 EDA

Check summary.

summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Detect any outliers

boxplot(insurance$charges)

hist(insurance$charges)

Remove outliers.

q.charges <- quantile(insurance$charges)

insurance %>% filter(!charges > q.charges[4] + 1.5 * (q.charges[4] - q.charges[2])) -> insurance_clean

dataset <- list()
dataset[[1]] <- insurance
dataset[[2]] <- insurance_clean

Analyze correlation between charges and other variables.

ggcorr(insurance,label = T)

## Warning in ggcorr(insurance, label = T): data in column(s) 'sex', 'smoker',
## 'region' are not numeric and were ignored

plot(insurance$sex,insurance$charges)

plot(insurance$smoker,insurance$charges)

plot(insurance$region,insurance$charges)

Analysis:

Observations related to charges shows about 10% outliers, which later can be used for comparing models
Age as numeric variable shows strongest correlation to charges, while only smoker category shows significance correlation to charges

5 Modelling

5.1 Manual Feature Selection

Create linear regression model using Age and Smoker predictors.

model <- list()

for (item in dataset) {
  model[[length(model)+1]] <- lm(charges ~ age + smoker,item)
}

Evaluate models

i <- 0
for (item in model) {
  i <- i+1
  print(paste("Model No.",i))
  print("------------------")
  print(paste("Adj R Squared:",summary(item)$adj.r.squared))
  print(paste("RMSE:",RMSE(item$fitted.values,dataset[[i]]$charges)))
  print(paste("Shapiro-Wilk Test P-Value:",shapiro.test(item$residuals)$p.value))
  print(paste("Breusch-Pagan Test P-Value:",bptest(item)$p.value))
  print(paste("VIF:",vif(item)))
  print("------------------")
}

## [1] "Model No. 1"
## [1] "------------------"
## [1] "Adj R Squared: 0.720983449018106"
## [1] "RMSE: 6389.57695722642"
## [1] "Shapiro-Wilk Test P-Value: 9.86277258793249e-37"
## [1] "Breusch-Pagan Test P-Value: 3.64483356704953e-54"
## [1] "VIF: 1.00062632997212" "VIF: 1.00062632997212"
## [1] "------------------"
## [1] "Model No. 2"
## [1] "------------------"
## [1] "Adj R Squared: 0.591731480145691"
## [1] "RMSE: 4621.0108443498"
## [1] "Shapiro-Wilk Test P-Value: 1.94644919027387e-48"
## [1] "Breusch-Pagan Test P-Value: 0.0182165772533092"
## [1] "VIF: 1.00456224187403" "VIF: 1.00456224187403"
## [1] "------------------"

Result shows model from dataset with outliers is better in terms of adjusted R-squared. However, both models not showing normal distribution and homoscedasticity of residuals.

5.2 Step-wise Regression

model_none <- lm(charges ~ 1,insurance)
model_all <- lm(charges ~ .,insurance)
model_step <- list()
model_step[[1]] <- step(model_none,direction = "forward",trace = F,scope = list(upper = model_all,lower = model_none))
model_step[[2]] <- step(model_all,direction = "backward",trace = F)
model_step[[3]] <- step(model_all,data = insurance,direction = "both",trace = F,scope = list(upper = model_all))

i <- 0
for (item in model_step) {
  i <- i+1
  print(paste("Model No.",i))
  print("------------------")
  print(paste("Adj R Squared:",summary(item)$adj.r.squared))
  print(paste("RMSE:",RMSE(item$fitted.values,insurance$charges)))
  print(paste("Shapiro-Wilk Test P-Value:",shapiro.test(item$residuals)$p.value))
  print(paste("Breusch-Pagan Test P-Value:",bptest(item)$p.value))
  print(paste("VIF:",vif(item)))
  print("------------------")
}

## [1] "Model No. 1"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741846e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905244e-23"
##  [1] "VIF: 1.00636872563314" "VIF: 1.01618764565747" "VIF: 1.1041974319208" 
##  [4] "VIF: 1.00371409080625" "VIF: 1.09886952082173" "VIF: 1"               
##  [7] "VIF: 1"                "VIF: 1"                "VIF: 1"               
## [10] "VIF: 3"                "VIF: 1.0031793088143"  "VIF: 1.00806133030558"
## [13] "VIF: 1.05080799003472" "VIF: 1.00185532428902" "VIF: 1.01583776589367"
## [1] "------------------"
## [1] "Model No. 2"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741812e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905222e-23"
##  [1] "VIF: 1.01618764565747" "VIF: 1.1041974319208"  "VIF: 1.00371409080625"
##  [4] "VIF: 1.00636872563314" "VIF: 1.09886952082173" "VIF: 1"               
##  [7] "VIF: 1"                "VIF: 1"                "VIF: 1"               
## [10] "VIF: 3"                "VIF: 1.00806133030558" "VIF: 1.05080799003472"
## [13] "VIF: 1.00185532428902" "VIF: 1.0031793088143"  "VIF: 1.01583776589367"
## [1] "------------------"
## [1] "Model No. 3"
## [1] "------------------"
## [1] "Adj R Squared: 0.749572742711622"
## [1] "RMSE: 6042.0332153941"
## [1] "Shapiro-Wilk Test P-Value: 8.73942161741812e-29"
## [1] "Breusch-Pagan Test P-Value: 5.13405122905222e-23"
##  [1] "VIF: 1.01618764565747" "VIF: 1.1041974319208"  "VIF: 1.00371409080625"
##  [4] "VIF: 1.00636872563314" "VIF: 1.09886952082173" "VIF: 1"               
##  [7] "VIF: 1"                "VIF: 1"                "VIF: 1"               
## [10] "VIF: 3"                "VIF: 1.00806133030558" "VIF: 1.05080799003472"
## [13] "VIF: 1.00185532428902" "VIF: 1.0031793088143"  "VIF: 1.01583776589367"
## [1] "------------------"

Result shows all feature selection directions have picked the same features. Using step-wise regression still resulting models without normal distribution and homoscedasticity of residuals.

6 Conclusion

Outliers have low influence to model, thus can be kept.
Best model comes from step-wise regression that can fit 75% of data.
All models not showing normal distribution and homoscedasticity of residuals.
All models show no multicollinearity, with VIF < 10.