GETTING STARTED WITH TIDYMODELS
Modelling Medical Insurance Costs with Linear Regression and tidymodels
Background
The
tidymodelsframework is a collection of packages for modeling and machine learning using tidyverse principles https://www.`tidymodels`.org/.
The idea behind tidymodels is to address the inconsistency problem prevalent in R (Kuhn and Wickham 2020). Given that different machine learning engines were developed by different people, the terminologies and steps tend to differ across these engines. The caret package, which has been the bedrock of machine learning in R, did not adequately address these inconsistency issues. The developer of caret, Marx Kuhn, has been at the forefront of developing tidymodels alongside other data scientists at R Studio. The tidymodels framework provides a standardized framework for fitting machine learning models by providing a uniform syntax. You can read more on tidymodels here and here and here.
To install tidymodels, use the following code.
install.packages("tidymodels")
The tidymodels framework consists of several key packages for sampling, feature engineering, fitting, hyper-parameter tuning, and evaluation of machine learning models.
- Sampling: The
rsamplepackage allows for the creation of training and testing data sets. - Feature engineering: The
recipespackage allows for creation of new variables by manipulating the existing variables. - Model fitting: The
parsnippackage provides a consistent framework for fitting models. - Model tuning: The
dialsandtunepackages provide functions for tuning parameters. - Model evaluation: The
yardstickpackage allows for model evaluation.
Objective and Caveats
In this exercise, I use a simple example of medical insurance to illustrate beginner concepts in tidymodels. This write up is NOT for intermediate or advanced users of tidymodels. Again, the exercise aims at illustrating basic steps in using tidymodels and NOT on model fitting and performance. I assume basic knowledge of R programming language and statistical analysis environment (R Core Team 2021).
NB: Click on
Codeto view the R code associated with any of the steps.
The Data
I use a sample of health insurance cost data from Kagle and linear regression to demonstrate tidymodels. The data is available on this link. However, I have loaded the data into my Github account. The direct link to the data is https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv. The data contain variables to predict the cost of health insurance. These variables include age,sex,body mass index (BMI),children,smoker,and region. I seek to predict health insurance charges using the independent variables provided. First, I read the data into R.
## Reading in the data
insurance_data <- read_csv("https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv") %>%
mutate(across(where(is_character),as_factor))
## The first six rows of the data
head(insurance_data) %>%
## Convert variable names to upper case
set_names(names(.) %>%
str_to_upper()) %>%
## Make a nice table
knitr::kable(caption = "First 6 Rows of the Insurance Data", booktabs = TRUE) %>%
kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")| AGE | SEX | BMI | CHILDREN | SMOKER | REGION | CHARGES |
|---|---|---|---|---|---|---|
| 19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 |
| 28 | male | 33.000 | 3 | no | southeast | 4449.462 |
| 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
| 32 | male | 28.880 | 0 | no | northwest | 3866.855 |
| 31 | female | 25.740 | 0 | no | southeast | 3756.622 |
The data has 1338 rows and 7 columns. I start by splitting the data into a training set and a testing set and then proceed to model the data and evaluate the model performance on the test data.
The Training Set and Testing Set
I use the rsample package to split the data into training and testing sets. The rsample initial_split function takes the following inputs: the data, the proportion of the data that will go into the training set (prop), and the strata used to divide the data (strata). I split the data into two parts, the training set and the test set, with proportions of 75% and 25%, respectively.
## Create a split object with 75% of the data in training set
insurance_split <- initial_split(insurance_data,
prop = 0.75, strata = charges)
## generate the training set
insurance_training <- insurance_split %>%
training()
## generate the testing set
insurance_testing <- insurance_split %>%
testing()The training set has 1002 observations while the testing set has 336 observations.
Exploratory Data Analysis
I summarise and visualize the training data in this section.
## The data
insurance_training %>%
## Use ggally to run pairwise graphs
GGally::ggpairs(mapping = aes(col = sex)) +
## Choose colors
scale_color_manual(values = c("indianred4", "grey")) +
## Choose colors
scale_fill_manual(values = c("indianred4", "grey")) +
## Add title
labs(title = "Pairwise Visualisation of the Health Insurance Data")The summary of the numeric variables is as follows
## The data
insurance_training %>%
## Select numeric data
select(where(is.numeric)) %>%
## Get summary statistics
skimr::skim_without_charts() %>%
## Select important variables
select(-skim_type, -complete_rate) %>%
## Do a nice table
knitr::kable(caption = "Summary of Numeric Variables in the Training Set")| skim_variable | n_missing | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
|---|---|---|---|---|---|---|---|---|
| age | 0 | 39.28942 | 14.125316 | 18.000 | 27.000 | 39.000 | 51.00000 | 64.00 |
| bmi | 0 | 30.52600 | 6.134900 | 15.960 | 26.185 | 30.115 | 34.47125 | 53.13 |
| children | 0 | 1.07485 | 1.193299 | 0.000 | 0.000 | 1.000 | 2.00000 | 5.00 |
| charges | 0 | 13172.61074 | 11918.779405 | 1121.874 | 4740.287 | 9382.033 | 16639.91251 | 63770.43 |
The table below shows the summary for the factor variables.
## The data
table(insurance_training$sex) %>%
## Do a nice table
knitr::kable(col.names = c("Sex", "Freq"), caption = "Summary of the Sex Variable")| Sex | Freq |
|---|---|
| female | 494 |
| male | 508 |
################################################### The data
table(insurance_training$smoker) %>%
## Do a nice table
knitr::kable(col.names = c("Smoker", "Freq"), caption = "Summary of the Smoker Variable")| Smoker | Freq |
|---|---|
| no | 796 |
| yes | 206 |
################################################## The data
table(insurance_training$region) %>%
## Do a nice table
knitr::kable(col.names = c("Region", "Freq"), caption = "Summary of the Region Variable")| Region | Freq |
|---|---|
| northeast | 246 |
| northwest | 250 |
| southeast | 270 |
| southwest | 236 |
Running the Linear Regression Model
The parsnip package provides a standardized interface for running numerous machine learning models with a unified syntax, unlike caret, where one has to memorize different terminologies used in separate packages. I start by defining the linear regression model (Aiken et al. 2012).
## Specify the model to run
linear_regression_model <- linear_reg() %>%
## Set the engine to be used in running the regression
set_engine("lm") %>%
## Specify the kind of model: regression or classification
set_mode("regression")Next, I run the linear regression using the fit function from parsnip, specifying the variables as we do when running a regression using the lm function.
## Fit the model on training data
linear_regression_insurance <- linear_regression_model %>%
fit(charges ~ age + sex + bmi + children + smoker + region,
data = insurance_training)
## Create a tibble of the regression outputs
broom::tidy(linear_regression_insurance) %>%
## make a nice table
knitr::kable(caption = "Output of the Linear Regression Model", booktabs = TRUE) %>%
kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -11926.0737 | 1116.39685 | -10.6826473 | 0.0000000 |
| age | 249.9296 | 13.58589 | 18.3962684 | 0.0000000 |
| sexmale | -474.3536 | 381.56714 | -1.2431719 | 0.2140979 |
| bmi | 354.5126 | 32.65751 | 10.8554683 | 0.0000000 |
| children | 430.5708 | 159.69883 | 2.6961424 | 0.0071334 |
| smokeryes | 23385.8384 | 472.78416 | 49.4640906 | 0.0000000 |
| regionnorthwest | -171.1264 | 541.37952 | -0.3160931 | 0.7519982 |
| regionsoutheast | -1091.4418 | 548.19093 | -1.9909884 | 0.0467558 |
| regionsouthwest | -1002.5757 | 550.96107 | -1.8196852 | 0.0691080 |
The model shows that all the variables except sex are significant drivers of health insurance charges. Age, BMI, children, and smoking are all positively related to health insurance charges. For the region, compared to North East, individuals from all the other areas have, on average, lower health insurance charges. However, the coefficient for the North West region is not significant.
Model Predictions
I use the test data to predict the charges for patients in the test set. The predict function serves this purpose. The predict function generates a new variable, .pred with the same order and number of columns with the test data.
## Use the model to make predictions on the test data
insurance_predictions <- linear_regression_insurance %>%
predict(new_data = insurance_testing)
## View the first six rows of the predictions
head(insurance_predictions) %>%
## Make a nice table
knitr::kable(caption = "First 6 Rows of the Predictions")| .pred |
|---|
| 3409.325 |
| 10764.718 |
| 8283.220 |
| 12059.181 |
| 31577.684 |
| 15833.928 |
I attach this prediction to the original dataset.
## Attach the predictions to the testing data
insurance_testing_with_predictions <- insurance_testing %>%
bind_cols(insurance_predictions)
## View the first six rows of data with predictions
head(insurance_testing_with_predictions) %>%
## Change the case of variable names
set_names(names(.) %>%
str_to_sentence()) %>%
## make a nice table
knitr::kable(caption = "First 6 Rows of the Testing Set with Predictions")| Age | Sex | Bmi | Children | Smoker | Region | Charges | .Pred |
|---|---|---|---|---|---|---|---|
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 | 3409.325 |
| 46 | female | 33.440 | 1 | no | southeast | 8240.590 | 10764.718 |
| 37 | male | 29.830 | 2 | no | northeast | 6406.411 | 8283.220 |
| 60 | female | 25.840 | 0 | no | northwest | 28923.137 | 12059.181 |
| 27 | male | 42.130 | 0 | yes | southeast | 39611.758 | 31577.684 |
| 60 | female | 36.005 | 0 | no | northeast | 13228.847 | 15833.928 |
Model Evaluation
I first plot the predicted charges against the actual charges to visualize how well the model fits the data.
## Plot predicted versus actual charges
insurance_testing_with_predictions %>%
## Call ggplot for plotting and give X and Y axis
ggplot(mapping = aes(x = charges, y = .pred, col = sex)) +
## Geom_point to do scatter plot
geom_point(shape = 1, size = 2, alpha = 0.5, stroke = 3) +
## geom_abline to do a line of best fit
geom_abline(col = "purple") +
## Harmonize X and Y axis scales
coord_obs_pred() +
## Put a nice theme
ggthemes::theme_clean() +
## Add labels and titles
labs(x = "Actual Charges", y = "Predicted Charges",
title = "Actual versus Predicted Charges") +
scale_color_manual(values = c("red", "blue")) +
## Remove legend title
theme(legend.title = element_blank())The yardstick package provides a set of functions for evaluating models. For linear regression models, the root mean squared error (RMSE) and the coefficient of determination(\(r^{2}\)) are the prominent evaluation metrics. I compute these two metrics in turns.
## Compute the root mean squared error (RMSE)
insurance_testing_with_predictions %>%
rmse(truth = charges, estimate = .pred) %>%
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")| .metric | .estimator | .estimate |
|---|---|---|
| rmse | standard | 6224.63 |
Likewise, the coefficient of determination is as follows.
## Get the coefficient of determination
insurance_testing_with_predictions %>%
rsq(truth = charges, estimate = .pred) %>%
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")| .metric | .estimator | .estimate |
|---|---|---|
| rsq | standard | 0.7628016 |
Further Steps
The tidymodels ecosystem provides for an even easier route to modelling using the last_fit function. The last_fit function takes in the model type (in our case, the linear_regression_model), the model specification (that is, the dependent and independent variables), and the split object. The function then runs the entire model as follows.
- It splits the data into training and testing sets.
- It runs the model on the training set.
- Makes predictions and evaluates the model on the testing set.
After this, one can run the collect_predictions and collect_metrics functions to get the model predictions on the test set and the evaluation metrics (RMSE and \(R^{2}\)), respectively.
## Specify the model
insurance_model_final_fit <- linear_regression_model %>%
## Specify the variables and the split object
last_fit(charges ~ age + sex + bmi + children + smoker + region,
split = insurance_split)Here I get the predictions. Note that the tibble consists of the predictions of the dependent variable and the actual value of the dependent variable, in this case, charges. In this case, I have just presented the first ten observations.
## Get the predictions on the test set
insurance_model_final_fit %>%
collect_predictions() %>%
## Get top 10 observations of the predictions
head(10) %>%
## Make a nice table
knitr::kable(caption = "Predictions on the Test Set")| id | .pred | .row | charges | .config |
|---|---|---|---|---|
| train/test split | 3409.325 | 2 | 1725.552 | Preprocessor1_Model1 |
| train/test split | 10764.718 | 7 | 8240.590 | Preprocessor1_Model1 |
| train/test split | 8283.220 | 9 | 6406.411 | Preprocessor1_Model1 |
| train/test split | 12059.181 | 10 | 28923.137 | Preprocessor1_Model1 |
| train/test split | 31577.684 | 15 | 39611.758 | Preprocessor1_Model1 |
| train/test split | 15833.928 | 21 | 13228.847 | Preprocessor1_Model1 |
| train/test split | 12847.132 | 26 | 14001.134 | Preprocessor1_Model1 |
| train/test split | 4111.927 | 33 | 4687.797 | Preprocessor1_Model1 |
| train/test split | 7227.232 | 48 | 3556.922 | Preprocessor1_Model1 |
| train/test split | 5202.170 | 51 | 2211.131 | Preprocessor1_Model1 |
Finally, I collect the metrics for evaluation the model performance using the collect_metrics function.
## Get the evaluation metrics on the test set
insurance_model_final_fit %>%
collect_metrics() %>%
## Make a nice table
knitr::kable(caption = "Model Evaluation Metrics on the Test Set")| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 6224.6295258 | Preprocessor1_Model1 |
| rsq | standard | 0.7628016 | Preprocessor1_Model1 |
Conclusion
In this exercise, I have used medical insurance data to illustrate basic concepts of tidymodels. It is notable how tidymodels introduces consistent syntax, which simplifies the modelling process. The model in this example could improve, for instance, by taking the log of the skewed independent variable charges. However, this exercise aimed not to optimize the model but to illustrate the use of tidymodels.