GETTING STARTED WITH TIDYMODELS

Modelling Medical Insurance Costs with Linear Regression and tidymodels

Background

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles https://www.`tidymodels`.org/.

The idea behind tidymodels is to address the inconsistency problem prevalent in R (Kuhn and Wickham 2020). Given that different machine learning engines were developed by different people, the terminologies and steps tend to differ across these engines. The caret package, which has been the bedrock of machine learning in R, did not adequately address these inconsistency issues. The developer of caret, Marx Kuhn, has been at the forefront of developing tidymodels alongside other data scientists at R Studio. The tidymodels framework provides a standardized framework for fitting machine learning models by providing a uniform syntax. You can read more on tidymodels here and here and here.

To install tidymodels, use the following code.

install.packages("tidymodels")

The tidymodels framework consists of several key packages for sampling, feature engineering, fitting, hyper-parameter tuning, and evaluation of machine learning models.

  1. Sampling: The rsample package allows for the creation of training and testing data sets.
  2. Feature engineering: The recipes package allows for creation of new variables by manipulating the existing variables.
  3. Model fitting: The parsnip package provides a consistent framework for fitting models.
  4. Model tuning: The dials and tune packages provide functions for tuning parameters.
  5. Model evaluation: The yardstick package allows for model evaluation.

Objective and Caveats

In this exercise, I use a simple example of medical insurance to illustrate beginner concepts in tidymodels. This write up is NOT for intermediate or advanced users of tidymodels. Again, the exercise aims at illustrating basic steps in using tidymodels and NOT on model fitting and performance. I assume basic knowledge of R programming language and statistical analysis environment (R Core Team 2021).

NB: Click on Code to view the R code associated with any of the steps.

The Data

I use a sample of health insurance cost data from Kagle and linear regression to demonstrate tidymodels. The data is available on this link. However, I have loaded the data into my Github account. The direct link to the data is https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv. The data contain variables to predict the cost of health insurance. These variables include age,sex,body mass index (BMI),children,smoker,and region. I seek to predict health insurance charges using the independent variables provided. First, I read the data into R.

## Reading in the data
insurance_data <- read_csv("https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv") %>% 
    
    mutate(across(where(is_character),as_factor))


## The first six rows of the data
head(insurance_data) %>% 
    
    ## Convert variable names to upper case
    set_names(names(.) %>% 
                
                str_to_upper()) %>% 
    
    ## Make a nice table
    knitr::kable(caption = "First 6 Rows of the Insurance Data", booktabs = TRUE) %>% 
    
    kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")
First 6 Rows of the Insurance Data
AGE SEX BMI CHILDREN SMOKER REGION CHARGES
19 female 27.900 0 yes southwest 16884.924
18 male 33.770 1 no southeast 1725.552
28 male 33.000 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.471
32 male 28.880 0 no northwest 3866.855
31 female 25.740 0 no southeast 3756.622

The data has 1338 rows and 7 columns. I start by splitting the data into a training set and a testing set and then proceed to model the data and evaluate the model performance on the test data.

The Training Set and Testing Set

I use the rsample package to split the data into training and testing sets. The rsample initial_split function takes the following inputs: the data, the proportion of the data that will go into the training set (prop), and the strata used to divide the data (strata). I split the data into two parts, the training set and the test set, with proportions of 75% and 25%, respectively.

## Create a split object with 75% of the data in training set
insurance_split <- initial_split(insurance_data, 
                                 
                prop = 0.75, strata = charges)

## generate the training set
insurance_training <- insurance_split %>% 
    
    training()

## generate the testing set
insurance_testing <- insurance_split %>% 
    
    testing()

The training set has 1002 observations while the testing set has 336 observations.

Exploratory Data Analysis

I summarise and visualize the training data in this section.

## The data
insurance_training %>% 

## Use ggally to run pairwise graphs
GGally::ggpairs(mapping = aes(col = sex)) + 
## Choose colors
scale_color_manual(values = c("indianred4", "grey")) + 
## Choose colors
scale_fill_manual(values = c("indianred4", "grey")) + 
## Add title
labs(title = "Pairwise Visualisation of the Health Insurance Data")

The summary of the numeric variables is as follows

## The data
insurance_training %>% 
## Select numeric data
select(where(is.numeric)) %>% 
## Get summary statistics
skimr::skim_without_charts() %>% 
## Select important variables
select(-skim_type, -complete_rate) %>% 
## Do a nice table
knitr::kable(caption = "Summary of Numeric Variables in the Training Set")
Summary of Numeric Variables in the Training Set
skim_variable n_missing numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
age 0 39.28942 14.125316 18.000 27.000 39.000 51.00000 64.00
bmi 0 30.52600 6.134900 15.960 26.185 30.115 34.47125 53.13
children 0 1.07485 1.193299 0.000 0.000 1.000 2.00000 5.00
charges 0 13172.61074 11918.779405 1121.874 4740.287 9382.033 16639.91251 63770.43

The table below shows the summary for the factor variables.

## The data
table(insurance_training$sex) %>% 
## Do a nice table
knitr::kable(col.names = c("Sex", "Freq"), caption = "Summary of the Sex Variable")
Summary of the Sex Variable
Sex Freq
female 494
male 508
################################################### The data
table(insurance_training$smoker) %>% 
## Do a nice table
knitr::kable(col.names = c("Smoker", "Freq"), caption = "Summary of the Smoker Variable")
Summary of the Smoker Variable
Smoker Freq
no 796
yes 206
################################################## The data
table(insurance_training$region) %>% 
## Do a nice table
knitr::kable(col.names = c("Region", "Freq"), caption = "Summary of the Region Variable")
Summary of the Region Variable
Region Freq
northeast 246
northwest 250
southeast 270
southwest 236

Running the Linear Regression Model

The parsnip package provides a standardized interface for running numerous machine learning models with a unified syntax, unlike caret, where one has to memorize different terminologies used in separate packages. I start by defining the linear regression model (Aiken et al. 2012).

## Specify the model to run
linear_regression_model <- linear_reg() %>% 
## Set the engine to be used in running the regression
set_engine("lm") %>% 
## Specify the kind of model: regression or classification
set_mode("regression")

Next, I run the linear regression using the fit function from parsnip, specifying the variables as we do when running a regression using the lm function.

## Fit the model on training data
linear_regression_insurance <- linear_regression_model %>% 
    
    fit(charges ~ age + sex + bmi + children + smoker + region, 
        
        data = insurance_training)

## Create a tibble of the regression outputs
broom::tidy(linear_regression_insurance) %>% 
    
    ## make a nice table
    knitr::kable(caption = "Output of the Linear Regression Model", booktabs = TRUE) %>% 
    
    kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")
Output of the Linear Regression Model
term estimate std.error statistic p.value
(Intercept) -11926.0737 1116.39685 -10.6826473 0.0000000
age 249.9296 13.58589 18.3962684 0.0000000
sexmale -474.3536 381.56714 -1.2431719 0.2140979
bmi 354.5126 32.65751 10.8554683 0.0000000
children 430.5708 159.69883 2.6961424 0.0071334
smokeryes 23385.8384 472.78416 49.4640906 0.0000000
regionnorthwest -171.1264 541.37952 -0.3160931 0.7519982
regionsoutheast -1091.4418 548.19093 -1.9909884 0.0467558
regionsouthwest -1002.5757 550.96107 -1.8196852 0.0691080

The model shows that all the variables except sex are significant drivers of health insurance charges. Age, BMI, children, and smoking are all positively related to health insurance charges. For the region, compared to North East, individuals from all the other areas have, on average, lower health insurance charges. However, the coefficient for the North West region is not significant.

Model Predictions

I use the test data to predict the charges for patients in the test set. The predict function serves this purpose. The predict function generates a new variable, .pred with the same order and number of columns with the test data.

## Use the model to make predictions on the test data
insurance_predictions <- linear_regression_insurance %>% 
predict(new_data = insurance_testing)

## View the first six rows of the predictions
head(insurance_predictions) %>% 
## Make a nice table
knitr::kable(caption = "First 6 Rows of the Predictions")
First 6 Rows of the Predictions
.pred
3409.325
10764.718
8283.220
12059.181
31577.684
15833.928

I attach this prediction to the original dataset.

## Attach the predictions to the testing data
insurance_testing_with_predictions <- insurance_testing %>% 
bind_cols(insurance_predictions)

## View the first six rows of data with predictions
head(insurance_testing_with_predictions) %>% 
## Change the case of variable names
set_names(names(.) %>%
    str_to_sentence()) %>% 
## make a nice table
knitr::kable(caption = "First 6 Rows of the Testing Set with Predictions")
First 6 Rows of the Testing Set with Predictions
Age Sex Bmi Children Smoker Region Charges .Pred
18 male 33.770 1 no southeast 1725.552 3409.325
46 female 33.440 1 no southeast 8240.590 10764.718
37 male 29.830 2 no northeast 6406.411 8283.220
60 female 25.840 0 no northwest 28923.137 12059.181
27 male 42.130 0 yes southeast 39611.758 31577.684
60 female 36.005 0 no northeast 13228.847 15833.928

Model Evaluation

I first plot the predicted charges against the actual charges to visualize how well the model fits the data.

## Plot predicted versus actual charges
insurance_testing_with_predictions %>% 
    
    ## Call ggplot for plotting and give X and Y axis
    ggplot(mapping = aes(x = charges, y = .pred, col = sex)) + 
    
    ## Geom_point to do scatter plot
    geom_point(shape = 1, size = 2, alpha = 0.5, stroke = 3) + 
    
    ## geom_abline to do a line of best fit
    geom_abline(col = "purple") + 
    
    ## Harmonize X and Y axis scales
    coord_obs_pred() + 
    
    ## Put a nice theme 
    ggthemes::theme_clean() + 
  
    ## Add labels and titles
    labs(x = "Actual Charges", y = "Predicted Charges", 
         
         title = "Actual versus Predicted Charges") + 
    
    scale_color_manual(values = c("red", "blue")) + 
    
    ## Remove legend title
    theme(legend.title = element_blank())

The yardstick package provides a set of functions for evaluating models. For linear regression models, the root mean squared error (RMSE) and the coefficient of determination(\(r^{2}\)) are the prominent evaluation metrics. I compute these two metrics in turns.

## Compute the root mean squared error (RMSE)
insurance_testing_with_predictions %>% 
rmse(truth = charges, estimate = .pred) %>% 
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")
The RMSE on the Test Set
.metric .estimator .estimate
rmse standard 6224.63

Likewise, the coefficient of determination is as follows.

## Get the coefficient of determination
insurance_testing_with_predictions %>% 
rsq(truth = charges, estimate = .pred) %>% 
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")
The RMSE on the Test Set
.metric .estimator .estimate
rsq standard 0.7628016

Further Steps

The tidymodels ecosystem provides for an even easier route to modelling using the last_fit function. The last_fit function takes in the model type (in our case, the linear_regression_model), the model specification (that is, the dependent and independent variables), and the split object. The function then runs the entire model as follows.

  1. It splits the data into training and testing sets.
  2. It runs the model on the training set.
  3. Makes predictions and evaluates the model on the testing set.

After this, one can run the collect_predictions and collect_metrics functions to get the model predictions on the test set and the evaluation metrics (RMSE and \(R^{2}\)), respectively.

## Specify the model
insurance_model_final_fit <- linear_regression_model %>% 
    
    ## Specify the variables and the split object
    last_fit(charges ~ age + sex + bmi + children + smoker + region, 
        
        split = insurance_split)

Here I get the predictions. Note that the tibble consists of the predictions of the dependent variable and the actual value of the dependent variable, in this case, charges. In this case, I have just presented the first ten observations.

## Get the predictions on the test set
insurance_model_final_fit %>% 
collect_predictions() %>% 
## Get top 10 observations of the predictions
head(10) %>% 
## Make a nice table
knitr::kable(caption = "Predictions on the Test Set")
Predictions on the Test Set
id .pred .row charges .config
train/test split 3409.325 2 1725.552 Preprocessor1_Model1
train/test split 10764.718 7 8240.590 Preprocessor1_Model1
train/test split 8283.220 9 6406.411 Preprocessor1_Model1
train/test split 12059.181 10 28923.137 Preprocessor1_Model1
train/test split 31577.684 15 39611.758 Preprocessor1_Model1
train/test split 15833.928 21 13228.847 Preprocessor1_Model1
train/test split 12847.132 26 14001.134 Preprocessor1_Model1
train/test split 4111.927 33 4687.797 Preprocessor1_Model1
train/test split 7227.232 48 3556.922 Preprocessor1_Model1
train/test split 5202.170 51 2211.131 Preprocessor1_Model1

Finally, I collect the metrics for evaluation the model performance using the collect_metrics function.

## Get the evaluation metrics on the test set
insurance_model_final_fit %>% 
collect_metrics() %>% 
## Make a nice table
knitr::kable(caption = "Model Evaluation Metrics on the Test Set")
Model Evaluation Metrics on the Test Set
.metric .estimator .estimate .config
rmse standard 6224.6295258 Preprocessor1_Model1
rsq standard 0.7628016 Preprocessor1_Model1

Conclusion

In this exercise, I have used medical insurance data to illustrate basic concepts of tidymodels. It is notable how tidymodels introduces consistent syntax, which simplifies the modelling process. The model in this example could improve, for instance, by taking the log of the skewed independent variable charges. However, this exercise aimed not to optimize the model but to illustrate the use of tidymodels.

References

Aiken, Leona S, Stephen G West, Steven C Pitts, Amanda N Baraldi, and Ingrid C Wurpts. 2012. Multiple Linear Regression. Handbook of Psychology, Second Edition. Vol. 2. London: Wiley Online Library.
Kuhn, M, and H Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” Boston, MA, USA.[(accessed on 24 May 2021)].
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.