Introduction

The purpose of this analysis is to propose a ML model that can predict the medical cost that a patient may incur at a given time given certain characteristics of the patient. This notebook will only focus on using the Random Forest model by adjusting its main hyperparameters.

Data was pulled from the renowned community of data scientists Kaggle and will make use of the ecosystem of packages offered by Tidymodels.

Columns description

There are 7 columns in the data, which are described below:

  • Age: age of primary beneficiary.
  • Sex: insurance contractor gender (female or male).
  • Bmi: Body mass index.
  • Children: Number of children covered by health insurance / Number of dependents.
  • Smoker: Active smoking status.
  • Region: the beneficiary’s residential area in the US (northeast, southeast, southwest or northwest).
  • Charges: Individual medical costs billed by health insurance.

Exploratory data analysis

Before starting to establish any type of model, it is necessary to explore the data to determine if there are inconsistencies in it, such as missing values, and also to understand a little about how our response variable is affected by all the other variables.

Dataset overview

Using the skim function from the skimr package we can get an overview of all the variables in the data.

Data summary
Name insurance
Number of rows 1338
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 mal: 676, fem: 662
smoker 0 1 FALSE 2 no: 1064, yes: 274
region 0 1 FALSE 4 sou: 364, nor: 325, sou: 325, nor: 324

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
age 0 1 39.21 14.05 18.00 27.00 39.00 51.00 64.00
bmi 0 1 30.66 6.10 15.96 26.30 30.40 34.69 53.13
children 0 1 1.09 1.21 0.00 0.00 1.00 2.00 5.00
charges 0 1 13270.42 12110.01 1121.87 4740.29 9382.03 16639.91 63770.43

So that there are no missing values in any of the variables. On average, all the insured have at least one child, although there are many with no children, it is also observed that in general the people registered in the data suffer from obesity (bmi >= 30 is classified as obese).

Distribution of medical charges

Let us observe the empirical distribution of the medical charges incurred by the insured.

As expected in all the variables that indicate monetary amount or income, they have a bias to the right in their distribution. To normalize the data and stabilize the variance we will proceed with a logarithmic transformation.

Categorical variables

Next, the relationship between the categorical variables and the response variable is explored. The goal is to find changes in the median which will be the patterns to be captured by our model.

Sex of the insured

No relevant changes are observed due to the sex of the insured.

Active smoking status

It seems that if the insured person smokes, higher medical costs can be expected. This statement seems obvious because smoking influences health and affects the body.

Will there be any relationship between the sex of the insured person and whether or not they are an active smoker?

No interaction is observed in these variables in relation to the medical cost of the insured.

Region of residence

No relevant changes are observed in the response variable with respect to the region of residence of the insured.

Number of children

It seems that most of the insured have between 0 and 1 children in their policies but it does not seem that this is related to the medical cost that each insured incurs.

Numeric variables

Next, the relationship between the numerical variables and the response variable is explored.

Age of the insured

A positive linear relationship is observed between the age of the insured and the medical cost, however, from previous graphs it was found that the status of smoker greatly impacts the response variable.

Will there be any relationship between this and the age of the insured?

There is an interaction between these variables that demonstrates differences in the mean medical cost. It would be a good idea to use this interaction in the model.

Body mass index

Like the previous step, it seems that here we have another obvious interaction between these variables with respect to medical cost.

Split data

Training and testing sets

Performing supervised learning models using all the available data is not usually a good option because this often causes “overfitting” problems. It is better to partition the data into two sets, one to train the models and another to test the models, this will allow us to know the performance of the models on new data.

# Split 80% / 20%
set.seed(123)
split <- initial_split(data = insurance, prop = 0.80, strata = charges)

# Training set
train_set <- training(split)

# Testing set
test_set <- testing(split)

Cross validation

Apart from partitioning the data into training and test sets, the training set will be partitioned into 5 data subsets, this allows having a variety of partitions to train and measure the result of the hyperparameter adjustment process.

set.seed(123)
folds <- vfold_cv(train_set, strata = charges, v = 5)

Feature engineering

Based on what was observed in the exploratory analysis, a recipe will be created with the pre-processing using feature engineering.

Preprocessing recipe

The steps that will be included in the recipe for data pre-processing are explained below:

  • The logarithmic transformation is performed on the response variable.
  • A new categorical variable is created that indicates whether the insured can be categorized as obese according to their body mass index.
  • All predictor variables are normalized.
  • All categorical variables are converted to Dummy.
  • Finally, the interactions observed in the exploratory analysis of the data are created.
recipe <- 
  recipe(charges ~ ., data = train_set) %>% 
  step_log(all_outcomes()) %>% 
  step_mutate(is_obsessive = 
                if_else(bmi > 30, "Yes", "No") %>% 
                factor(levels = c("Yes", "No"))) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact(terms = ~ smoker_yes:age) %>% 
  step_interact(terms = ~ smoker_yes:bmi)

Hyperparameter tuning

We already have the recipe with the pre-processing of the data, we only need to establish the workflow of our Random Forest model and the grid with all the possible combinations of hyperparameters to adjust.

The three main hyperparameters to adjust are the following:

  • mtry: An integer for the number of predictors that will be randomly sampled at each split when creating the tree models.
  • trees: An integer for the number of trees contained in the ensemble.
  • min_n: An integer for the minimum number of data points in a node that are required for the node to be split further.

Model spec

# Random forest spec
rf_spec <- rand_forest(
  mtry = tune(), trees = tune(), min_n = tune()
  ) %>%
  set_engine(
  "ranger", num.threads = 7, importance = "impurity"
  ) %>%
  set_mode("regression")

# Ranfom forest workflow
rf_Wf <- workflow() %>% 
  add_recipe(recipe) %>%
  add_model(rf_spec)

Setting grid

A method of filling spaces will be used to establish the grid of possible values for the hyperparameters. For more information consult.

# Space-filling designs grid
rf_grid <- 
  grid_latin_hypercube(
    min_n(), 
    mtry(range = c(4, 9)), 
    trees(), 
    size = 80)

Now let’s visualize the grid.

Tuning Hyperparameters

# Lets use seven cores for this process
doParallel::registerDoParallel(cores = 7)

set.seed(123)
tune_res <- 
  rf_Wf %>% 
  tune_grid(
    resamples = folds, grid = rf_grid, 
    metrics = metric_set(rmse, mae, rsq)
    )

Let’s visualize the tuning results.

Best fives models for minimize root mean square error.

.config .metric mtry trees min_n mean
Preprocessor1_Model14 rmse 6 525 35 0.3719130
Preprocessor1_Model21 rmse 6 1621 31 0.3719945
Preprocessor1_Model47 rmse 5 613 23 0.3720179
Preprocessor1_Model04 rmse 7 1998 37 0.3720898
Preprocessor1_Model42 rmse 5 790 36 0.3723533

Best fives models to maximize coefficient of determination.

.config .metric mtry trees min_n mean
Preprocessor1_Model57 rsq 5 983 36 0.8362315
Preprocessor1_Model14 rsq 6 525 35 0.8360723
Preprocessor1_Model42 rsq 5 790 36 0.8360524
Preprocessor1_Model47 rsq 5 613 23 0.8359879
Preprocessor1_Model09 rsq 5 1356 36 0.8359692

It seems that under both approaches the same combination of parameters is obtained.

Final Model

# finalize model set up
final_rf <- rf_Wf %>% 
  finalize_workflow(
    show_best(x = tune_res, metric = "rmse", n = 1)
  )

Now lets fit the final model to testing set to see his performance.

final_rs <-
  final_rf %>%
  last_fit(
    split, metrics = metric_set(rmse, mae, rsq)
    )

The easiest way to visually assess the agreement between observed and predicted values is with a scatterplot. let’s do it.

It looks good but it seems that for some people the model tends to underestimate the values.

Variable importance determination

From the above graph it can be concluded that the age and smoking status of the insured are the main variables to predict the medical cost. In addition, it seems that the interactions between the variables are relevant and, as observed in the exploratory analysis of the data, the sex and the region where the insured is located are not relevant.