The purpose of this analysis is to propose a ML model that can predict the medical cost that a patient may incur at a given time given certain characteristics of the patient. This notebook will only focus on using the Random Forest model by adjusting its main hyperparameters.
Data was pulled from the renowned community of data scientists Kaggle and will make use of the ecosystem of packages offered by Tidymodels.
There are 7 columns in the data, which are described below:
Before starting to establish any type of model, it is necessary to explore the data to determine if there are inconsistencies in it, such as missing values, and also to understand a little about how our response variable is affected by all the other variables.
Using the skim function from the skimr package we can get an overview of all the variables in the data.
| Name | insurance |
| Number of rows | 1338 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | mal: 676, fem: 662 |
| smoker | 0 | 1 | FALSE | 2 | no: 1064, yes: 274 |
| region | 0 | 1 | FALSE | 4 | sou: 364, nor: 325, sou: 325, nor: 324 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 39.21 | 14.05 | 18.00 | 27.00 | 39.00 | 51.00 | 64.00 |
| bmi | 0 | 1 | 30.66 | 6.10 | 15.96 | 26.30 | 30.40 | 34.69 | 53.13 |
| children | 0 | 1 | 1.09 | 1.21 | 0.00 | 0.00 | 1.00 | 2.00 | 5.00 |
| charges | 0 | 1 | 13270.42 | 12110.01 | 1121.87 | 4740.29 | 9382.03 | 16639.91 | 63770.43 |
So that there are no missing values in any of the variables. On average, all the insured have at least one child, although there are many with no children, it is also observed that in general the people registered in the data suffer from obesity (bmi >= 30 is classified as obese).
Let us observe the empirical distribution of the medical charges incurred by the insured.
As expected in all the variables that indicate monetary amount or income, they have a bias to the right in their distribution. To normalize the data and stabilize the variance we will proceed with a logarithmic transformation.
Next, the relationship between the categorical variables and the response variable is explored. The goal is to find changes in the median which will be the patterns to be captured by our model.
No relevant changes are observed due to the sex of the insured.
It seems that if the insured person smokes, higher medical costs can be expected. This statement seems obvious because smoking influences health and affects the body.
Will there be any relationship between the sex of the insured person and whether or not they are an active smoker?
No interaction is observed in these variables in relation to the medical cost of the insured.
No relevant changes are observed in the response variable with respect to the region of residence of the insured.
It seems that most of the insured have between 0 and 1 children in their policies but it does not seem that this is related to the medical cost that each insured incurs.
Next, the relationship between the numerical variables and the response variable is explored.
A positive linear relationship is observed between the age of the insured and the medical cost, however, from previous graphs it was found that the status of smoker greatly impacts the response variable.
Will there be any relationship between this and the age of the insured?
There is an interaction between these variables that demonstrates differences in the mean medical cost. It would be a good idea to use this interaction in the model.
Like the previous step, it seems that here we have another obvious interaction between these variables with respect to medical cost.
Performing supervised learning models using all the available data is not usually a good option because this often causes “overfitting” problems. It is better to partition the data into two sets, one to train the models and another to test the models, this will allow us to know the performance of the models on new data.
# Split 80% / 20%
set.seed(123)
split <- initial_split(data = insurance, prop = 0.80, strata = charges)
# Training set
train_set <- training(split)
# Testing set
test_set <- testing(split)
Apart from partitioning the data into training and test sets, the training set will be partitioned into 5 data subsets, this allows having a variety of partitions to train and measure the result of the hyperparameter adjustment process.
set.seed(123)
folds <- vfold_cv(train_set, strata = charges, v = 5)
Based on what was observed in the exploratory analysis, a recipe will be created with the pre-processing using feature engineering.
The steps that will be included in the recipe for data pre-processing are explained below:
recipe <-
recipe(charges ~ ., data = train_set) %>%
step_log(all_outcomes()) %>%
step_mutate(is_obsessive =
if_else(bmi > 30, "Yes", "No") %>%
factor(levels = c("Yes", "No"))) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ smoker_yes:age) %>%
step_interact(terms = ~ smoker_yes:bmi)
We already have the recipe with the pre-processing of the data, we only need to establish the workflow of our Random Forest model and the grid with all the possible combinations of hyperparameters to adjust.
The three main hyperparameters to adjust are the following:
# Random forest spec
rf_spec <- rand_forest(
mtry = tune(), trees = tune(), min_n = tune()
) %>%
set_engine(
"ranger", num.threads = 7, importance = "impurity"
) %>%
set_mode("regression")
# Ranfom forest workflow
rf_Wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(rf_spec)
A method of filling spaces will be used to establish the grid of possible values for the hyperparameters. For more information consult.
# Space-filling designs grid
rf_grid <-
grid_latin_hypercube(
min_n(),
mtry(range = c(4, 9)),
trees(),
size = 80)
Now let’s visualize the grid.
# Lets use seven cores for this process
doParallel::registerDoParallel(cores = 7)
set.seed(123)
tune_res <-
rf_Wf %>%
tune_grid(
resamples = folds, grid = rf_grid,
metrics = metric_set(rmse, mae, rsq)
)
Let’s visualize the tuning results.
Best fives models for minimize root mean square error.
| .config | .metric | mtry | trees | min_n | mean |
|---|---|---|---|---|---|
| Preprocessor1_Model14 | rmse | 6 | 525 | 35 | 0.3719130 |
| Preprocessor1_Model21 | rmse | 6 | 1621 | 31 | 0.3719945 |
| Preprocessor1_Model47 | rmse | 5 | 613 | 23 | 0.3720179 |
| Preprocessor1_Model04 | rmse | 7 | 1998 | 37 | 0.3720898 |
| Preprocessor1_Model42 | rmse | 5 | 790 | 36 | 0.3723533 |
Best fives models to maximize coefficient of determination.
| .config | .metric | mtry | trees | min_n | mean |
|---|---|---|---|---|---|
| Preprocessor1_Model57 | rsq | 5 | 983 | 36 | 0.8362315 |
| Preprocessor1_Model14 | rsq | 6 | 525 | 35 | 0.8360723 |
| Preprocessor1_Model42 | rsq | 5 | 790 | 36 | 0.8360524 |
| Preprocessor1_Model47 | rsq | 5 | 613 | 23 | 0.8359879 |
| Preprocessor1_Model09 | rsq | 5 | 1356 | 36 | 0.8359692 |
It seems that under both approaches the same combination of parameters is obtained.
# finalize model set up
final_rf <- rf_Wf %>%
finalize_workflow(
show_best(x = tune_res, metric = "rmse", n = 1)
)
Now lets fit the final model to testing set to see his performance.
final_rs <-
final_rf %>%
last_fit(
split, metrics = metric_set(rmse, mae, rsq)
)
The easiest way to visually assess the agreement between observed and predicted values is with a scatterplot. let’s do it.
It looks good but it seems that for some people the model tends to underestimate the values.
From the above graph it can be concluded that the age and smoking status of the insured are the main variables to predict the medical cost. In addition, it seems that the interactions between the variables are relevant and, as observed in the exploratory analysis of the data, the sex and the region where the insured is located are not relevant.