GETTING STARTED WITH `TIDYMODELS`

Modelling Medical Insurance Costs with Linear Regression and `tidymodels`

Background

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles https://www.`tidymodels`.org/.

The idea behind tidymodels is to address the inconsistency problem prevalent in R (Kuhn and Wickham 2020). Given that different machine learning engines were developed by different people, the terminologies and steps tend to differ across these engines. The caret package, which has been the bedrock of machine learning in R, did not adequately address these inconsistency issues. The developer of caret, Marx Kuhn, has been at the forefront of developing tidymodels alongside other data scientists at R Studio. The tidymodels framework provides a standardized framework for fitting machine learning models by providing a uniform syntax. You can read more on tidymodels here and here and here.

To install tidymodels, use the following code.

install.packages("tidymodels")

The tidymodels framework consists of several key packages for sampling, feature engineering, fitting, hyper-parameter tuning, and evaluation of machine learning models.

Sampling: The rsample package allows for the creation of training and testing data sets.
Feature engineering: The recipes package allows for creation of new variables by manipulating the existing variables.
Model fitting: The parsnip package provides a consistent framework for fitting models.
Model tuning: The dials and tune packages provide functions for tuning parameters.
Model evaluation: The yardstick package allows for model evaluation.

Objective and Caveats

In this exercise, I use a simple example of medical insurance to illustrate beginner concepts in tidymodels. This write up is NOT for intermediate or advanced users of tidymodels. Again, the exercise aims at illustrating basic steps in using tidymodels and NOT on model fitting and performance. I assume basic knowledge of R programming language and statistical analysis environment (R Core Team 2021).

NB: Click on Code to view the R code associated with any of the steps.

The Data

I use a sample of health insurance cost data from Kagle and linear regression to demonstrate tidymodels. The data is available on this link. However, I have loaded the data into my Github account. The direct link to the data is https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv. The data contain variables to predict the cost of health insurance. These variables include age,sex,body mass index (BMI),children,smoker,and region. I seek to predict health insurance charges using the independent variables provided. First, I read the data into R.

## Reading in the data
insurance_data <- read_csv("https://raw.githubusercontent.com/Karuitha/tidymodels_exe/master/insurance.csv") %>% 
    
    mutate(across(where(is_character),as_factor))


## The first six rows of the data
head(insurance_data) %>% 
    
    ## Convert variable names to upper case
    set_names(names(.) %>% 
                
                str_to_upper()) %>% 
    
    ## Make a nice table
    knitr::kable(caption = "First 6 Rows of the Insurance Data", booktabs = TRUE) %>% 
    
    kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")

First 6 Rows of the Insurance Data
AGE	SEX	BMI	CHILDREN	SMOKER	REGION	CHARGES
19	female	27.900	0	yes	southwest	16884.924
18	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855
31	female	25.740	0	no	southeast	3756.622

The data has 1338 rows and 7 columns. I start by splitting the data into a training set and a testing set and then proceed to model the data and evaluate the model performance on the test data.

The Training Set and Testing Set

I use the rsample package to split the data into training and testing sets. The rsample initial_split function takes the following inputs: the data, the proportion of the data that will go into the training set (prop), and the strata used to divide the data (strata). I split the data into two parts, the training set and the test set, with proportions of 75% and 25%, respectively.

## Create a split object with 75% of the data in training set
insurance_split <- initial_split(insurance_data, 
                                 
                prop = 0.75, strata = charges)

## generate the training set
insurance_training <- insurance_split %>% 
    
    training()

## generate the testing set
insurance_testing <- insurance_split %>% 
    
    testing()

The training set has 1002 observations while the testing set has 336 observations.

Exploratory Data Analysis

I summarise and visualize the training data in this section.

## The data
insurance_training %>% 

## Use ggally to run pairwise graphs
GGally::ggpairs(mapping = aes(col = sex)) + 
## Choose colors
scale_color_manual(values = c("indianred4", "grey")) + 
## Choose colors
scale_fill_manual(values = c("indianred4", "grey")) + 
## Add title
labs(title = "Pairwise Visualisation of the Health Insurance Data")

The summary of the numeric variables is as follows

## The data
insurance_training %>% 
## Select numeric data
select(where(is.numeric)) %>% 
## Get summary statistics
skimr::skim_without_charts() %>% 
## Select important variables
select(-skim_type, -complete_rate) %>% 
## Do a nice table
knitr::kable(caption = "Summary of Numeric Variables in the Training Set")

Summary of Numeric Variables in the Training Set
skim_variable	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100
age	39.28942	14.125316	18.000	27.000	39.000	51.00000	64.00
bmi	30.52600	6.134900	15.960	26.185	30.115	34.47125	53.13
children	1.07485	1.193299	0.000	0.000	1.000	2.00000	5.00
charges	13172.61074	11918.779405	1121.874	4740.287	9382.033	16639.91251	63770.43

The table below shows the summary for the factor variables.

## The data
table(insurance_training$sex) %>% 
## Do a nice table
knitr::kable(col.names = c("Sex", "Freq"), caption = "Summary of the Sex Variable")

Summary of the Sex Variable
Sex	Freq
female	494
male	508

################################################### The data
table(insurance_training$smoker) %>% 
## Do a nice table
knitr::kable(col.names = c("Smoker", "Freq"), caption = "Summary of the Smoker Variable")

Summary of the Smoker Variable
Smoker	Freq
no	796
yes	206

################################################## The data
table(insurance_training$region) %>% 
## Do a nice table
knitr::kable(col.names = c("Region", "Freq"), caption = "Summary of the Region Variable")

Summary of the Region Variable
Region	Freq
northeast	246
northwest	250
southeast	270
southwest	236

Running the Linear Regression Model

The parsnip package provides a standardized interface for running numerous machine learning models with a unified syntax, unlike caret, where one has to memorize different terminologies used in separate packages. I start by defining the linear regression model (Aiken et al. 2012).

## Specify the model to run
linear_regression_model <- linear_reg() %>% 
## Set the engine to be used in running the regression
set_engine("lm") %>% 
## Specify the kind of model: regression or classification
set_mode("regression")

Next, I run the linear regression using the fit function from parsnip, specifying the variables as we do when running a regression using the lm function.

## Fit the model on training data
linear_regression_insurance <- linear_regression_model %>% 
    
    fit(charges ~ age + sex + bmi + children + smoker + region, 
        
        data = insurance_training)

## Create a tibble of the regression outputs
broom::tidy(linear_regression_insurance) %>% 
    
    ## make a nice table
    knitr::kable(caption = "Output of the Linear Regression Model", booktabs = TRUE) %>% 
    
    kableExtra::kable_styling(full_width = TRUE, bootstrap_options = "striped")

Output of the Linear Regression Model
term	estimate	std.error	statistic	p.value
(Intercept)	-11926.0737	1116.39685	-10.6826473	0.0000000
age	249.9296	13.58589	18.3962684	0.0000000
sexmale	-474.3536	381.56714	-1.2431719	0.2140979
bmi	354.5126	32.65751	10.8554683	0.0000000
children	430.5708	159.69883	2.6961424	0.0071334
smokeryes	23385.8384	472.78416	49.4640906	0.0000000
regionnorthwest	-171.1264	541.37952	-0.3160931	0.7519982
regionsoutheast	-1091.4418	548.19093	-1.9909884	0.0467558
regionsouthwest	-1002.5757	550.96107	-1.8196852	0.0691080

The model shows that all the variables except sex are significant drivers of health insurance charges. Age, BMI, children, and smoking are all positively related to health insurance charges. For the region, compared to North East, individuals from all the other areas have, on average, lower health insurance charges. However, the coefficient for the North West region is not significant.

Model Predictions

I use the test data to predict the charges for patients in the test set. The predict function serves this purpose. The predict function generates a new variable, .pred with the same order and number of columns with the test data.

## Use the model to make predictions on the test data
insurance_predictions <- linear_regression_insurance %>% 
predict(new_data = insurance_testing)

## View the first six rows of the predictions
head(insurance_predictions) %>% 
## Make a nice table
knitr::kable(caption = "First 6 Rows of the Predictions")

First 6 Rows of the Predictions
.pred
3409.325
10764.718
8283.220
12059.181
31577.684
15833.928

I attach this prediction to the original dataset.

## Attach the predictions to the testing data
insurance_testing_with_predictions <- insurance_testing %>% 
bind_cols(insurance_predictions)

## View the first six rows of data with predictions
head(insurance_testing_with_predictions) %>% 
## Change the case of variable names
set_names(names(.) %>%
    str_to_sentence()) %>% 
## make a nice table
knitr::kable(caption = "First 6 Rows of the Testing Set with Predictions")

First 6 Rows of the Testing Set with Predictions
Age	Sex	Bmi	Children	Smoker	Region	Charges	.Pred
18	male	33.770	1	no	southeast	1725.552	3409.325
46	female	33.440	1	no	southeast	8240.590	10764.718
37	male	29.830	2	no	northeast	6406.411	8283.220
60	female	25.840	0	no	northwest	28923.137	12059.181
27	male	42.130	0	yes	southeast	39611.758	31577.684
60	female	36.005	0	no	northeast	13228.847	15833.928

Model Evaluation

I first plot the predicted charges against the actual charges to visualize how well the model fits the data.

## Plot predicted versus actual charges
insurance_testing_with_predictions %>% 
    
    ## Call ggplot for plotting and give X and Y axis
    ggplot(mapping = aes(x = charges, y = .pred, col = sex)) + 
    
    ## Geom_point to do scatter plot
    geom_point(shape = 1, size = 2, alpha = 0.5, stroke = 3) + 
    
    ## geom_abline to do a line of best fit
    geom_abline(col = "purple") + 
    
    ## Harmonize X and Y axis scales
    coord_obs_pred() + 
    
    ## Put a nice theme 
    ggthemes::theme_clean() + 
  
    ## Add labels and titles
    labs(x = "Actual Charges", y = "Predicted Charges", 
         
         title = "Actual versus Predicted Charges") + 
    
    scale_color_manual(values = c("red", "blue")) + 
    
    ## Remove legend title
    theme(legend.title = element_blank())

The yardstick package provides a set of functions for evaluating models. For linear regression models, the root mean squared error (RMSE) and the coefficient of determination(\(r^{2}\)) are the prominent evaluation metrics. I compute these two metrics in turns.

## Compute the root mean squared error (RMSE)
insurance_testing_with_predictions %>% 
rmse(truth = charges, estimate = .pred) %>% 
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")

The RMSE on the Test Set
.metric	.estimator	.estimate
rmse	standard	6224.63

Likewise, the coefficient of determination is as follows.

## Get the coefficient of determination
insurance_testing_with_predictions %>% 
rsq(truth = charges, estimate = .pred) %>% 
## Make a nice table
knitr::kable(caption = "The RMSE on the Test Set")

The RMSE on the Test Set
.metric	.estimator	.estimate
rsq	standard	0.7628016

Further Steps

The tidymodels ecosystem provides for an even easier route to modelling using the last_fit function. The last_fit function takes in the model type (in our case, the linear_regression_model), the model specification (that is, the dependent and independent variables), and the split object. The function then runs the entire model as follows.

It splits the data into training and testing sets.
It runs the model on the training set.
Makes predictions and evaluates the model on the testing set.

After this, one can run the collect_predictions and collect_metrics functions to get the model predictions on the test set and the evaluation metrics (RMSE and \(R^{2}\)), respectively.

## Specify the model
insurance_model_final_fit <- linear_regression_model %>% 
    
    ## Specify the variables and the split object
    last_fit(charges ~ age + sex + bmi + children + smoker + region, 
        
        split = insurance_split)

Here I get the predictions. Note that the tibble consists of the predictions of the dependent variable and the actual value of the dependent variable, in this case, charges. In this case, I have just presented the first ten observations.

## Get the predictions on the test set
insurance_model_final_fit %>% 
collect_predictions() %>% 
## Get top 10 observations of the predictions
head(10) %>% 
## Make a nice table
knitr::kable(caption = "Predictions on the Test Set")

Predictions on the Test Set
id	.pred	.row	charges	.config
train/test split	3409.325	2	1725.552	Preprocessor1_Model1
train/test split	10764.718	7	8240.590	Preprocessor1_Model1
train/test split	8283.220	9	6406.411	Preprocessor1_Model1
train/test split	12059.181	10	28923.137	Preprocessor1_Model1
train/test split	31577.684	15	39611.758	Preprocessor1_Model1
train/test split	15833.928	21	13228.847	Preprocessor1_Model1
train/test split	12847.132	26	14001.134	Preprocessor1_Model1
train/test split	4111.927	33	4687.797	Preprocessor1_Model1
train/test split	7227.232	48	3556.922	Preprocessor1_Model1
train/test split	5202.170	51	2211.131	Preprocessor1_Model1

Finally, I collect the metrics for evaluation the model performance using the collect_metrics function.

## Get the evaluation metrics on the test set
insurance_model_final_fit %>% 
collect_metrics() %>% 
## Make a nice table
knitr::kable(caption = "Model Evaluation Metrics on the Test Set")

Model Evaluation Metrics on the Test Set
.metric	.estimator	.estimate	.config
rmse	standard	6224.6295258	Preprocessor1_Model1
rsq	standard	0.7628016	Preprocessor1_Model1

Conclusion

In this exercise, I have used medical insurance data to illustrate basic concepts of tidymodels. It is notable how tidymodels introduces consistent syntax, which simplifies the modelling process. The model in this example could improve, for instance, by taking the log of the skewed independent variable charges. However, this exercise aimed not to optimize the model but to illustrate the use of tidymodels.

References

Aiken, Leona S, Stephen G West, Steven C Pitts, Amanda N Baraldi, and Ingrid C Wurpts. 2012. Multiple Linear Regression. Handbook of Psychology, Second Edition. Vol. 2. London: Wiley Online Library.

Kuhn, M, and H Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” Boston, MA, USA.[(accessed on 24 May 2021)].

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

GETTING STARTED WITH TIDYMODELS

Modelling Medical Insurance Costs with Linear Regression and tidymodels