library(tidyverse)
library(tidymodels)
library(GGally)
library(rsample)
library(randomForest)
library(hrbrthemes)
library(wesanderson)We’ll take a look on a data set from this Medical Cost Personal Datasets from Kaggle and perform a simple exploratory data analysis and build a simple prediction model on cost of medical treatment. Cost of treatment may depends on many factors, but we’ll use variables given in this dataset to predict it and also reveal things that tend to cause it become higher!
To get started, let’s start by reading our data into the environment:
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
dat is a dataframe with 1,338 observations and 10 variables;
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking or not
- region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest
- charges: Individual medical costs billed by health insurance
(What we’d be most interested here, is with the variable charges, that is what we would try to predict!)
For a quick glimpse on our data’s summary statistics, we can use skim function from skimr:
## Skim summary statistics
## n obs: 1338
## n variables: 7
##
## ── Variable type:factor ────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## region 0 1338 1338 4 sou: 364, nor: 325, sou: 325, nor: 324
## sex 0 1338 1338 2 mal: 676, fem: 662, NA: 0
## smoker 0 1338 1338 2 no: 1064, yes: 274, NA: 0
## ordered
## FALSE
## FALSE
## FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## age 0 1338 1338 39.21 14.05 18 27 39 51 64 ▇▆▅▅▅▆▅▅
## children 0 1338 1338 1.09 1.21 0 0 1 2 5 ▇▅▁▃▂▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## bmi 0 1338 1338 30.66 6.1 15.96 26.3 30.4
## charges 0 1338 1338 13270.42 12110.01 1121.87 4740.29 9382.03
## p75 p100 hist
## 34.69 53.13 ▁▅▇▇▅▂▁▁
## 16639.91 63770.43 ▇▅▂▁▁▁▁▁
Apparently, we have no missing values on our data, so we’re good to go! The average, medical cost is USD 13,270 with a median value of USD 9382. The respondents’ gender and region of origin is evenly distributed, having age ranging from 18 to 64 years old. Non-smokers outnumber smokers 4 to 1.
We’ll also now observe the correlation between our numeric variables:
ggcorr(dat, label = T, color = "black", size = 5)+
labs(title = "Correlation Matrix",
subtitle = "Age, BMI & Children on Charged Medical Cost")+
theme(plot.title = element_text(family = "Roboto Condensed", size = 19, face = "bold",vjust = 0),
plot.subtitle = element_text(family = "Roboto Condensed", size = 16,vjust = 0))We can see that age has the highest correlation with charges amongst our numeric variables. Another observation we can make from this plot is that none of our numeric values is highly correlated with each other, so multicollinearity wouldn’t be a problem (we would get into this later).
Now let’s take a look on another factors;
ggplot(dat, aes(charges))+
geom_density(aes(fill = smoker), color = NA, alpha = 0.6) +
facet_wrap(~sex)+
theme_ipsum_rc()+
theme(legend.position = "top")+
scale_x_continuous(labels = scales::unit_format(suffix = "$"))+
scale_fill_manual(values = wes_palette("Moonrise2"))+
labs(title = "Distribution of Medical Costs for Smokers in Each Gender")From above plot, we can see that there is no obvious pattern between gender and charged medical cost. On the other hand, the same cannot be said with smoking status. It can be clearly deceived that smokers spends a lot more in terms of medical expenses compared to non-smokers!
Another variable that I’m interested here is children. I want to discover if having children will stops someone from their smoking habit, so let’s also visualize that:
no_child <- dat %>%
filter(smoker == "yes" & children != 0)
dat %>%
filter(smoker == "yes") %>%
ggplot(aes(children))+
geom_bar(fill = "#798e87", alpha = 0.6)+
geom_bar(data = no_child, aes(children), fill = "#c27c39", alpha = 0.6)+
theme_ipsum_rc()+
theme(legend.position = "top")+
scale_x_continuous(breaks = seq(0,5))+
labs( y = "Number of Smokers",
x = "Number of Children",
title = "Number of Smokers with Children")It seems that having children doesn’t always stop someone from smoking! But we can see the more children they have, they will less likely to keep your smoking habit.
Lastly, we’ll take a look how region of origin relates to charged medical costs;
dat %>%
mutate(region = str_to_title(region)) %>%
ggplot(aes(region, charges))+
geom_boxplot()+
geom_jitter(aes(color = region),alpha = 0.4)+
theme_ipsum_rc()+
scale_y_continuous(labels = scales::unit_format(suffix = "$"))+
scale_color_manual(values = wes_palette("Moonrise2"))+
theme(legend.position = "none")+
labs(title = "Distribution of Medical Costs per Region")Based from the plot above, we can see that region of origin doesn’t have much impact with the amount of medical cost. We might consider this for later feature selection, but this time, we’ll just keep all our variables as it were.
We’ll now predict the charged medical costs based on given variables on dataset. But before we build our models,let’s begin with splitting our data for training and testing our model to train and test data. The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.
set.seed(100)
splitted <- initial_split(dat, prop = 0.8, strata = "charges")
dat_train <- training(splitted)
dat_test <- testing(splitted)model_lm <- stats::step(lm(charges ~., data = dat_train), direction = "backward", trace = 0)
summary(model_lm)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = dat_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12891.5 -2706.3 -908.4 1510.2 28540.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12095.01 1056.79 -11.445 <0.0000000000000002 ***
## age 265.67 13.17 20.178 <0.0000000000000002 ***
## bmi 314.74 30.69 10.255 <0.0000000000000002 ***
## children 391.11 153.63 2.546 0.011 *
## smokeryes 24637.40 460.07 53.552 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6081 on 1069 degrees of freedom
## Multiple R-squared: 0.7619, Adjusted R-squared: 0.761
## F-statistic: 855.1 on 4 and 1069 DF, p-value: < 0.00000000000000022
By performing stepwise selection, we see that age, bmi, children and smoker were amongst the most contributive predictors. The result is actually in line with our previous boxplot of “Distribution of Medical Cost per Region”, showing that region of origin doesn’t have much impact with the amount of medical cost.
We also find that smoker has the highest coefficient among the others! Our model coefficient also indicates that on average, smokers’ medical charges is higher by about USD 24,000! So, yeah, smoking is bad for you, guys..
Linear regression is good on interpretability, but as we see, the model performance (Adjusted R-Squared) is not really good. This might also because we haven’t perform any further pre-process such as normalizing our data and stuffs. But we’ll just stick on this one for now and let’s take a look how our model perform on the test data:
select(dat_test, charges) %>%
mutate(
pred_lm = predict(model_lm, dat_test)
) %>%
summarise(
mae = mae_vec(truth = charges, estimate = pred_lm),
rsq = rsq_vec(truth = charges, estimate = pred_lm),
)## mae rsq
## 1 4251.699 0.7048348
We also found that on test data, the R-squared is declining to 0.70! Not a good thing though..
Now let’s perform the assumption checking to our linear model.
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
## age bmi children smoker
## 1.012773 1.011056 1.002416 1.001297
##
## studentized Breusch-Pagan test
##
## data: model_lm
## BP = 92.33, df = 4, p-value < 0.00000000000000022
##
## Shapiro-Wilk normality test
##
## data: model_lm$residuals
## W = 0.89837, p-value < 0.00000000000000022
Uh-oh.. we only pass 1 from 3 assumptions that should be passed. We fail to pass the heteroscedasticity and normality test!
Let’s check the distribution on charges:
This might be what caused all the fuss above. There are so many outliers found in our data! To deal with outliers, one of the most common used method is to scale our numeric data. I’m gonna use recipe from tidymodels to perform the scaling since it let us revert our function easier later:
# define preprocess recipe from train dataset
rec <- recipe(charges ~ ., data = dat_train) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# prepare recipes-revert functions
rec_rev <- function(x, rec){
means <- rec$steps[[2]]$means[["charges"]]
sds <- rec$steps[[3]]$sds[["charges"]]
x <- (x * sds + means) ^ 2
x
}
# apply recipe to train data
dat_train_scaled <- juice(rec)
# apply recipe to test data
dat_test_scaled <- bake(rec, dat_test)After we perform the scaling, let’s take a look back to charges distribution:
Way better now! Now let’s re-build our model and compare the performance:
rbind(
# performance with scaling
"Scaled" = select(dat_test, charges) %>%
mutate(
pred_lm = predict(model_lm_scaled, dat_test_scaled),
pred_lm = rec_rev(pred_lm, rec)
) %>%
summarise(
mae = mae_vec(truth = charges, estimate = pred_lm),
rsq = rsq_vec(truth = charges, estimate = pred_lm),
),
# performance without scaling
"Not Scaled" = select(dat_test, charges) %>%
mutate(
pred_lm = predict(model_lm, dat_test)
) %>%
summarise(
mae = mae_vec(truth = charges, estimate = pred_lm),
rsq = rsq_vec(truth = charges, estimate = pred_lm),
)
)## mae rsq
## Scaled 3672.025 0.7184345
## Not Scaled 4251.699 0.7048348
The difference on Rsquared wasn’t too significant. But we can see when we build our model from a scaled data, the MAE is lower than before.
model_rf <- randomForest(
formula = charges ~ age + bmi + children + smoker,
data = dat_train_scaled
)
select(dat_test, charges) %>%
mutate(
pred_lm = predict(model_rf, dat_test_scaled),
pred_lm = rec_rev(pred_lm, rec)
) %>%
summarise(
mae = mae_vec(truth = charges, estimate = pred_lm),
rsq = rsq_vec(truth = charges, estimate = pred_lm),
)## mae rsq
## 1 3020.573 0.8145478
With randomforest, we get better performance with higher R-Squared and lower MAE. And since we’re using random forest algorithm, we can also ignore all the assumptions above.