library(tidyverse)
library(tidymodels)
library(GGally)
library(rsample)
library(randomForest)

library(hrbrthemes)
library(wesanderson)

We’ll take a look on a data set from this Medical Cost Personal Datasets from Kaggle and perform a simple exploratory data analysis and build a simple prediction model on cost of medical treatment. Cost of treatment may depends on many factors, but we’ll use variables given in this dataset to predict it and also reveal things that tend to cause it become higher!

To get started, let’s start by reading our data into the environment:

dat <- read.csv("data/insurance.csv")
head(dat)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

dat is a dataframe with 1,338 observations and 10 variables;
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking or not
- region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest
- charges: Individual medical costs billed by health insurance

(What we’d be most interested here, is with the variable charges, that is what we would try to predict!)

Exploratory Data Analysis

For a quick glimpse on our data’s summary statistics, we can use skim function from skimr:

skimr::skim(dat)

## Skim summary statistics
##  n obs: 1338 
##  n variables: 7 
## 
## ── Variable type:factor ────────────────────────────────────────────────────────
##  variable missing complete    n n_unique                             top_counts
##    region       0     1338 1338        4 sou: 364, nor: 325, sou: 325, nor: 324
##       sex       0     1338 1338        2              mal: 676, fem: 662, NA: 0
##    smoker       0     1338 1338        2              no: 1064, yes: 274, NA: 0
##  ordered
##    FALSE
##    FALSE
##    FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────
##  variable missing complete    n  mean    sd p0 p25 p50 p75 p100     hist
##       age       0     1338 1338 39.21 14.05 18  27  39  51   64 ▇▆▅▅▅▆▅▅
##  children       0     1338 1338  1.09  1.21  0   0   1   2    5 ▇▅▁▃▂▁▁▁
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────
##  variable missing complete    n     mean       sd      p0     p25     p50
##       bmi       0     1338 1338    30.66     6.1    15.96   26.3    30.4 
##   charges       0     1338 1338 13270.42 12110.01 1121.87 4740.29 9382.03
##       p75     p100     hist
##     34.69    53.13 ▁▅▇▇▅▂▁▁
##  16639.91 63770.43 ▇▅▂▁▁▁▁▁

Apparently, we have no missing values on our data, so we’re good to go! The average, medical cost is USD 13,270 with a median value of USD 9382. The respondents’ gender and region of origin is evenly distributed, having age ranging from 18 to 64 years old. Non-smokers outnumber smokers 4 to 1.

We’ll also now observe the correlation between our numeric variables:

ggcorr(dat, label = T, color = "black", size = 5)+
  labs(title = "Correlation Matrix",
       subtitle = "Age, BMI & Children on Charged Medical Cost")+
  theme(plot.title = element_text(family = "Roboto Condensed", size = 19, face = "bold",vjust = 0),
        plot.subtitle = element_text(family = "Roboto Condensed", size = 16,vjust = 0))

We can see that age has the highest correlation with charges amongst our numeric variables. Another observation we can make from this plot is that none of our numeric values is highly correlated with each other, so multicollinearity wouldn’t be a problem (we would get into this later).

Now let’s take a look on another factors;

ggplot(dat, aes(charges))+
  geom_density(aes(fill = smoker), color = NA, alpha = 0.6) +
  facet_wrap(~sex)+
  theme_ipsum_rc()+
  theme(legend.position = "top")+
  scale_x_continuous(labels = scales::unit_format(suffix = "$"))+
  scale_fill_manual(values = wes_palette("Moonrise2"))+
  labs(title = "Distribution of Medical Costs for Smokers in Each Gender")

From above plot, we can see that there is no obvious pattern between gender and charged medical cost. On the other hand, the same cannot be said with smoking status. It can be clearly deceived that smokers spends a lot more in terms of medical expenses compared to non-smokers!

Another variable that I’m interested here is children. I want to discover if having children will stops someone from their smoking habit, so let’s also visualize that:

no_child <- dat %>% 
  filter(smoker == "yes" & children != 0)

dat %>% 
  filter(smoker == "yes") %>% 
  ggplot(aes(children))+
  geom_bar(fill = "#798e87", alpha = 0.6)+
  geom_bar(data = no_child, aes(children), fill = "#c27c39", alpha = 0.6)+
  theme_ipsum_rc()+
  theme(legend.position = "top")+
  scale_x_continuous(breaks = seq(0,5))+
  labs( y = "Number of Smokers",
        x = "Number of Children",
        title = "Number of Smokers with Children")

It seems that having children doesn’t always stop someone from smoking! But we can see the more children they have, they will less likely to keep your smoking habit.

Lastly, we’ll take a look how region of origin relates to charged medical costs;

dat %>% 
  mutate(region = str_to_title(region)) %>% 
  
ggplot(aes(region, charges))+
  geom_boxplot()+
  geom_jitter(aes(color = region),alpha = 0.4)+
  theme_ipsum_rc()+
  scale_y_continuous(labels = scales::unit_format(suffix = "$"))+
  scale_color_manual(values = wes_palette("Moonrise2"))+
  theme(legend.position = "none")+
  labs(title = "Distribution of Medical Costs per Region")

Based from the plot above, we can see that region of origin doesn’t have much impact with the amount of medical cost. We might consider this for later feature selection, but this time, we’ll just keep all our variables as it were.

Model

We’ll now predict the charged medical costs based on given variables on dataset. But before we build our models,let’s begin with splitting our data for training and testing our model to train and test data. The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

set.seed(100)
splitted <- initial_split(dat, prop = 0.8, strata = "charges")

dat_train <- training(splitted)
dat_test <- testing(splitted)

Linear Regression

model_lm <- stats::step(lm(charges ~., data = dat_train), direction = "backward", trace = 0)
summary(model_lm)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = dat_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12891.5  -2706.3   -908.4   1510.2  28540.0 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -12095.01    1056.79 -11.445 <0.0000000000000002 ***
## age            265.67      13.17  20.178 <0.0000000000000002 ***
## bmi            314.74      30.69  10.255 <0.0000000000000002 ***
## children       391.11     153.63   2.546               0.011 *  
## smokeryes    24637.40     460.07  53.552 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6081 on 1069 degrees of freedom
## Multiple R-squared:  0.7619, Adjusted R-squared:  0.761 
## F-statistic: 855.1 on 4 and 1069 DF,  p-value: < 0.00000000000000022

By performing stepwise selection, we see that age, bmi, children and smoker were amongst the most contributive predictors. The result is actually in line with our previous boxplot of “Distribution of Medical Cost per Region”, showing that region of origin doesn’t have much impact with the amount of medical cost.

We also find that smoker has the highest coefficient among the others! Our model coefficient also indicates that on average, smokers’ medical charges is higher by about USD 24,000! So, yeah, smoking is bad for you, guys..

Linear regression is good on interpretability, but as we see, the model performance (Adjusted R-Squared) is not really good. This might also because we haven’t perform any further pre-process such as normalizing our data and stuffs. But we’ll just stick on this one for now and let’s take a look how our model perform on the test data:

select(dat_test, charges) %>%
  mutate(
    pred_lm = predict(model_lm, dat_test)
  ) %>% 
  summarise(
    mae = mae_vec(truth = charges, estimate = pred_lm),
    rsq = rsq_vec(truth = charges, estimate = pred_lm),
  )

##        mae       rsq
## 1 4251.699 0.7048348

We also found that on test data, the R-squared is declining to 0.70! Not a good thing though..

Now let’s perform the assumption checking to our linear model.

# multicollinearity
car::vif(model_lm)

## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

##      age      bmi children   smoker 
## 1.012773 1.011056 1.002416 1.001297

# heteroscedasticity
lmtest::bptest(model_lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_lm
## BP = 92.33, df = 4, p-value < 0.00000000000000022

# normality test
shapiro.test(model_lm$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_lm$residuals
## W = 0.89837, p-value < 0.00000000000000022

Uh-oh.. we only pass 1 from 3 assumptions that should be passed. We fail to pass the heteroscedasticity and normality test!

Linear Regression with Scaled Data

Let’s check the distribution on charges:

boxplot(dat_train$charges, main = "Distribution of Medical Cost in Insurance Data")

This might be what caused all the fuss above. There are so many outliers found in our data! To deal with outliers, one of the most common used method is to scale our numeric data. I’m gonna use recipe from tidymodels to perform the scaling since it let us revert our function easier later:

# define preprocess recipe from train dataset
rec <- recipe(charges ~ ., data = dat_train) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()

# prepare recipes-revert functions
rec_rev <- function(x, rec){

  means <- rec$steps[[2]]$means[["charges"]]
  sds <- rec$steps[[3]]$sds[["charges"]]

  x <- (x * sds + means) ^ 2

  x
}

# apply recipe to train data
dat_train_scaled <- juice(rec)

# apply recipe to test data
dat_test_scaled <- bake(rec, dat_test)

After we perform the scaling, let’s take a look back to charges distribution:

boxplot(dat_train_scaled$charges, main = "Distribution of Medical Cost in Insurance Data")

Way better now! Now let’s re-build our model and compare the performance:

model_lm_scaled <- stats::step(lm(charges ~., data = dat_train_scaled), direction = "backward", trace = 0)

rbind(
  # performance with scaling
 "Scaled" = select(dat_test, charges) %>%
  mutate(
    pred_lm = predict(model_lm_scaled, dat_test_scaled),
    pred_lm = rec_rev(pred_lm, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = charges, estimate = pred_lm),
    rsq = rsq_vec(truth = charges, estimate = pred_lm),
  ),
  
  # performance without scaling
 "Not Scaled" = select(dat_test, charges) %>%
  mutate(
    pred_lm = predict(model_lm, dat_test)
  ) %>% 
  summarise(
    mae = mae_vec(truth = charges, estimate = pred_lm),
    rsq = rsq_vec(truth = charges, estimate = pred_lm),
  )
)

##                 mae       rsq
## Scaled     3672.025 0.7184345
## Not Scaled 4251.699 0.7048348

The difference on Rsquared wasn’t too significant. But we can see when we build our model from a scaled data, the MAE is lower than before.

Random Forest (also with scaled data!)

model_rf <- randomForest(
  formula =  charges ~ age + bmi + children + smoker,
  data = dat_train_scaled
)

select(dat_test, charges) %>%
  mutate(
    pred_lm = predict(model_rf, dat_test_scaled),
    pred_lm = rec_rev(pred_lm, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = charges, estimate = pred_lm),
    rsq = rsq_vec(truth = charges, estimate = pred_lm),
  )

##        mae       rsq
## 1 3020.573 0.8145478

With randomforest, we get better performance with higher R-Squared and lower MAE. And since we’re using random forest algorithm, we can also ignore all the assumptions above.

Regression Analysis on Medical Insurance Cost

Tanesya

2/3/2020

Exploratory Data Analysis

Model

Linear Regression

Linear Regression with Scaled Data

Random Forest (also with scaled data!)