GETTING STARTED WITH TIDYMODELS
Modelling Medical Insurance Costs with Linear Regression and tidymodels
Background
The
tidymodels
framework is a collection of packages for modeling and machine learning using tidyverse principles https://www.`tidymodels`.org/.
The idea behind tidymodels
is to address the inconsistency problem prevalent in R
(Kuhn and Wickham 2020). Given that different machine learning engines were developed by different people, the terminologies and steps tend to differ across these engines. The caret
package, which has been the bedrock of machine learning in R
, did not adequately address these inconsistency issues. The developer of caret
, Marx Kuhn, has been at the forefront of developing tidymodels
alongside other data scientists at R Studio
. The tidymodels
framework provides a standardized framework for fitting machine learning
models by providing a uniform syntax. You can read more on tidymodels
here and here and here.
To install tidymodels
, use the following code.
install.packages("tidymodels")
The tidymodels
framework consists of several key packages for sampling, feature engineering, fitting, hyper-parameter tuning, and evaluation of machine learning
models.
- Sampling: The
rsample
package allows for the creation of training and testing data sets. - Feature engineering: The
recipes
package allows for creation of new variables by manipulating the existing variables. - Model fitting: The
parsnip
package provides a consistent framework for fitting models. - Model tuning: The
dials
andtune
packages provide functions for tuning parameters. - Model evaluation: The
yardstick
package allows for model evaluation.
Objective and Caveats
In this exercise, I use a simple example of medical insurance to illustrate beginner concepts in tidymodels
. This write up is NOT for intermediate or advanced users of tidymodels
. Again, the exercise aims at illustrating basic steps in using tidymodels
and NOT on model fitting and performance. I assume basic knowledge of R
programming language and statistical analysis environment (R Core Team 2021).
NB: Click on
Code
to view the R code associated with any of the steps.
The Data
I use a sample of health insurance cost data from Kagle and linear regression to demonstrate tidymodels
. The data is available on this link. However, I have loaded the data into my Github account. The direct link to the data is https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv. The data contain variables to predict the cost of health insurance. These variables include age,sex,body mass index (BMI),children,smoker,and region. I seek to predict health insurance charges
using the independent variables provided. First, I read the data into R
.
## Reading in the data
<- read_csv("https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv") %>%
insurance_data
mutate(across(where(is_character),as_factor))
## The first six rows of the data
head(insurance_data) %>%
## Convert variable names to upper case
set_names(names(.) %>%
str_to_upper()) %>%
## Make a nice table
::kable(caption = "First 6 Rows of the Insurance Data", booktabs = TRUE) %>%
knitr
::kable_styling(full_width = TRUE, bootstrap_options = "striped") kableExtra
AGE | SEX | BMI | CHILDREN | SMOKER | REGION | CHARGES |
---|---|---|---|---|---|---|
19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
18 | male | 33.770 | 1 | no | southeast | 1725.552 |
28 | male | 33.000 | 3 | no | southeast | 4449.462 |
33 | male | 22.705 | 0 | no | northwest | 21984.471 |
32 | male | 28.880 | 0 | no | northwest | 3866.855 |
31 | female | 25.740 | 0 | no | southeast | 3756.622 |
The data has 1338 rows and 7 columns. I start by splitting the data into a training set and a testing set and then proceed to model the data and evaluate the model performance on the test data.
The Training Set and Testing Set
I use the rsample
package to split the data into training and testing sets. The rsample
initial_split
function takes the following inputs: the data, the proportion of the data that will go into the training set (prop), and the strata used to divide the data (strata). I split the data into two parts, the training set and the test set, with proportions of 75% and 25%, respectively.
## Create a split object with 75% of the data in training set
<- initial_split(insurance_data,
insurance_split
prop = 0.75, strata = charges)
## generate the training set
<- insurance_split %>%
insurance_training
training()
## generate the testing set
<- insurance_split %>%
insurance_testing
testing()
The training set has 1002 observations while the testing set has 336 observations.
Exploratory Data Analysis
I summarise and visualize the training data in this section.
## The data
%>%
insurance_training
## Use ggally to run pairwise graphs
::ggpairs(mapping = aes(col = sex)) +
GGally## Choose colors
scale_color_manual(values = c("indianred4", "grey")) +
## Choose colors
scale_fill_manual(values = c("indianred4", "grey")) +
## Add title
labs(title = "Pairwise Visualisation of the Health Insurance Data")
The summary of the numeric variables is as follows
## The data
%>%
insurance_training ## Select numeric data
select(where(is.numeric)) %>%
## Get summary statistics
::skim_without_charts() %>%
skimr## Select important variables
select(-skim_type, -complete_rate) %>%
## Do a nice table
::kable(caption = "Summary of Numeric Variables in the Training Set") knitr
skim_variable | n_missing | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|
age | 0 | 39.28942 | 14.125316 | 18.000 | 27.000 | 39.000 | 51.00000 | 64.00 |
bmi | 0 | 30.52600 | 6.134900 | 15.960 | 26.185 | 30.115 | 34.47125 | 53.13 |
children | 0 | 1.07485 | 1.193299 | 0.000 | 0.000 | 1.000 | 2.00000 | 5.00 |
charges | 0 | 13172.61074 | 11918.779405 | 1121.874 | 4740.287 | 9382.033 | 16639.91251 | 63770.43 |
The table below shows the summary for the factor variables.
## The data
table(insurance_training$sex) %>%
## Do a nice table
::kable(col.names = c("Sex", "Freq"), caption = "Summary of the Sex Variable") knitr
Sex | Freq |
---|---|
female | 494 |
male | 508 |
################################################### The data
table(insurance_training$smoker) %>%
## Do a nice table
::kable(col.names = c("Smoker", "Freq"), caption = "Summary of the Smoker Variable") knitr
Smoker | Freq |
---|---|
no | 796 |
yes | 206 |
################################################## The data
table(insurance_training$region) %>%
## Do a nice table
::kable(col.names = c("Region", "Freq"), caption = "Summary of the Region Variable") knitr
Region | Freq |
---|---|
northeast | 246 |
northwest | 250 |
southeast | 270 |
southwest | 236 |
Running the Linear Regression Model
The parsnip
package provides a standardized interface for running numerous machine learning models with a unified syntax, unlike caret
, where one has to memorize different terminologies used in separate packages. I start by defining the linear regression model (Aiken et al. 2012).
## Specify the model to run
<- linear_reg() %>%
linear_regression_model ## Set the engine to be used in running the regression
set_engine("lm") %>%
## Specify the kind of model: regression or classification
set_mode("regression")
Next, I run the linear regression using the fit
function from parsnip
, specifying the variables as we do when running a regression using the lm
function.
## Fit the model on training data
<- linear_regression_model %>%
linear_regression_insurance
fit(charges ~ age + sex + bmi + children + smoker + region,
data = insurance_training)
## Create a tibble of the regression outputs
::tidy(linear_regression_insurance) %>%
broom
## make a nice table
::kable(caption = "Output of the Linear Regression Model", booktabs = TRUE) %>%
knitr
::kable_styling(full_width = TRUE, bootstrap_options = "striped") kableExtra
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -11926.0737 | 1116.39685 | -10.6826473 | 0.0000000 |
age | 249.9296 | 13.58589 | 18.3962684 | 0.0000000 |
sexmale | -474.3536 | 381.56714 | -1.2431719 | 0.2140979 |
bmi | 354.5126 | 32.65751 | 10.8554683 | 0.0000000 |
children | 430.5708 | 159.69883 | 2.6961424 | 0.0071334 |
smokeryes | 23385.8384 | 472.78416 | 49.4640906 | 0.0000000 |
regionnorthwest | -171.1264 | 541.37952 | -0.3160931 | 0.7519982 |
regionsoutheast | -1091.4418 | 548.19093 | -1.9909884 | 0.0467558 |
regionsouthwest | -1002.5757 | 550.96107 | -1.8196852 | 0.0691080 |
The model shows that all the variables except sex are significant drivers of health insurance charges. Age, BMI, children, and smoking are all positively related to health insurance charges. For the region, compared to North East, individuals from all the other areas have, on average, lower health insurance charges. However, the coefficient for the North West region is not significant.
Model Predictions
I use the test data to predict the charges for patients in the test set. The predict
function serves this purpose. The predict function generates a new variable, .pred
with the same order and number of columns with the test data.
## Use the model to make predictions on the test data
<- linear_regression_insurance %>%
insurance_predictions predict(new_data = insurance_testing)
## View the first six rows of the predictions
head(insurance_predictions) %>%
## Make a nice table
::kable(caption = "First 6 Rows of the Predictions") knitr
.pred |
---|
3409.325 |
10764.718 |
8283.220 |
12059.181 |
31577.684 |
15833.928 |
I attach this prediction to the original dataset.
## Attach the predictions to the testing data
<- insurance_testing %>%
insurance_testing_with_predictions bind_cols(insurance_predictions)
## View the first six rows of data with predictions
head(insurance_testing_with_predictions) %>%
## Change the case of variable names
set_names(names(.) %>%
str_to_sentence()) %>%
## make a nice table
::kable(caption = "First 6 Rows of the Testing Set with Predictions") knitr
Age | Sex | Bmi | Children | Smoker | Region | Charges | .Pred |
---|---|---|---|---|---|---|---|
18 | male | 33.770 | 1 | no | southeast | 1725.552 | 3409.325 |
46 | female | 33.440 | 1 | no | southeast | 8240.590 | 10764.718 |
37 | male | 29.830 | 2 | no | northeast | 6406.411 | 8283.220 |
60 | female | 25.840 | 0 | no | northwest | 28923.137 | 12059.181 |
27 | male | 42.130 | 0 | yes | southeast | 39611.758 | 31577.684 |
60 | female | 36.005 | 0 | no | northeast | 13228.847 | 15833.928 |
Model Evaluation
I first plot the predicted charges against the actual charges to visualize how well the model fits the data.
## Plot predicted versus actual charges
%>%
insurance_testing_with_predictions
## Call ggplot for plotting and give X and Y axis
ggplot(mapping = aes(x = charges, y = .pred, col = sex)) +
## Geom_point to do scatter plot
geom_point(shape = 1, size = 2, alpha = 0.5, stroke = 3) +
## geom_abline to do a line of best fit
geom_abline(col = "purple") +
## Harmonize X and Y axis scales
coord_obs_pred() +
## Put a nice theme
::theme_clean() +
ggthemes
## Add labels and titles
labs(x = "Actual Charges", y = "Predicted Charges",
title = "Actual versus Predicted Charges") +
scale_color_manual(values = c("red", "blue")) +
## Remove legend title
theme(legend.title = element_blank())
The yardstick
package provides a set of functions for evaluating models. For linear regression models, the root mean squared error (RMSE) and the coefficient of determination(\(r^{2}\)) are the prominent evaluation metrics. I compute these two metrics in turns.
## Compute the root mean squared error (RMSE)
%>%
insurance_testing_with_predictions rmse(truth = charges, estimate = .pred) %>%
## Make a nice table
::kable(caption = "The RMSE on the Test Set") knitr
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 6224.63 |
Likewise, the coefficient of determination is as follows.
## Get the coefficient of determination
%>%
insurance_testing_with_predictions rsq(truth = charges, estimate = .pred) %>%
## Make a nice table
::kable(caption = "The RMSE on the Test Set") knitr
.metric | .estimator | .estimate |
---|---|---|
rsq | standard | 0.7628016 |
Further Steps
The tidymodels
ecosystem provides for an even easier route to modelling using the last_fit
function. The last_fit
function takes in the model type (in our case, the linear_regression_model
), the model specification (that is, the dependent and independent variables), and the split object. The function then runs the entire model as follows.
- It splits the data into training and testing sets.
- It runs the model on the training set.
- Makes predictions and evaluates the model on the testing set.
After this, one can run the collect_predictions
and collect_metrics
functions to get the model predictions on the test set and the evaluation metrics (RMSE and \(R^{2}\)), respectively.
## Specify the model
<- linear_regression_model %>%
insurance_model_final_fit
## Specify the variables and the split object
last_fit(charges ~ age + sex + bmi + children + smoker + region,
split = insurance_split)
Here I get the predictions. Note that the tibble consists of the predictions of the dependent variable and the actual value of the dependent variable, in this case, charges. In this case, I have just presented the first ten observations.
## Get the predictions on the test set
%>%
insurance_model_final_fit collect_predictions() %>%
## Get top 10 observations of the predictions
head(10) %>%
## Make a nice table
::kable(caption = "Predictions on the Test Set") knitr
id | .pred | .row | charges | .config |
---|---|---|---|---|
train/test split | 3409.325 | 2 | 1725.552 | Preprocessor1_Model1 |
train/test split | 10764.718 | 7 | 8240.590 | Preprocessor1_Model1 |
train/test split | 8283.220 | 9 | 6406.411 | Preprocessor1_Model1 |
train/test split | 12059.181 | 10 | 28923.137 | Preprocessor1_Model1 |
train/test split | 31577.684 | 15 | 39611.758 | Preprocessor1_Model1 |
train/test split | 15833.928 | 21 | 13228.847 | Preprocessor1_Model1 |
train/test split | 12847.132 | 26 | 14001.134 | Preprocessor1_Model1 |
train/test split | 4111.927 | 33 | 4687.797 | Preprocessor1_Model1 |
train/test split | 7227.232 | 48 | 3556.922 | Preprocessor1_Model1 |
train/test split | 5202.170 | 51 | 2211.131 | Preprocessor1_Model1 |
Finally, I collect the metrics for evaluation the model performance using the collect_metrics
function.
## Get the evaluation metrics on the test set
%>%
insurance_model_final_fit collect_metrics() %>%
## Make a nice table
::kable(caption = "Model Evaluation Metrics on the Test Set") knitr
.metric | .estimator | .estimate | .config |
---|---|---|---|
rmse | standard | 6224.6295258 | Preprocessor1_Model1 |
rsq | standard | 0.7628016 | Preprocessor1_Model1 |
Conclusion
In this exercise, I have used medical insurance data to illustrate basic concepts of tidymodels
. It is notable how tidymodels
introduces consistent syntax, which simplifies the modelling process. The model in this example could improve, for instance, by taking the log of the skewed independent variable charges
. However, this exercise aimed not to optimize the model but to illustrate the use of tidymodels
.