Blog Post 4 - An Introduction To Tidymodels

Overview

In this blog post I’m going to provide an introduction to tidymodels. Tidymodels is the successor to the caret package. I you are like me, you may have used caret recently in completing some of your Data 621 homework assignemnts.

Tidymodels is a collection of modeling packages that, like the tidyverse, have consistent API and are designed to work together specifically to support predictive analytics and machine learning. Core tidymodel packages include: parsnip, recipes, rsample and tune. Collectively, these packages provide a grammar for modeling that makes things a lot easier and provide a unified modeling and analysis interface to seamlessly access several model varieteis in R.

In this post we will focus on four different libraries from the tidymodels suite: rsample for data sampling and cross-validation, recipes for data preprocessing, parsnip for model building and yardstick for model evaluation.

Let’s Get Started

Packages

We’ll load five packages: tidymodes, kableExtra, tidyverse, skimr and tibble. Tidymodels will be the star of this post.

library(tidymodels)
library(kableExtra)
library(tidyverse)
library(skimr)
library(tibble)

Data

We will use customer churn data for our analysis. After loading the data and the skimr package provides us a great data overview.

Skimr is a great package and we see that we have a pretty clean data set, with only a handful of missing values (TotalCharges).

telco <- readr::read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

telco %>% 
  skimr::skim()

Data summary

Name	Piped data
Number of rows	7043
Number of columns	21
_______________________
Column type frequency:
character	17
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
customerID	1	10	10	7043
gender	1	4	6	2
Partner	1	2	3	2
Dependents	1	2	3	2
PhoneService	1	2	3	2
MultipleLines	1	2	16	3
InternetService	1	2	11	3
OnlineSecurity	1	2	19	3
OnlineBackup	1	2	19	3
DeviceProtection	1	2	19	3
TechSupport	1	2	19	3
StreamingTV	1	2	19	3
StreamingMovies	1	2	19	3
Contract	1	8	14	3
PaperlessBilling	1	2	3	2
PaymentMethod	1	12	25	4
Churn	1	2	3	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SeniorCitizen	0	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
tenure	0	1	32.37	24.56	0.00	9.00	29.00	55.00	72.00	▇▃▃▃▆
MonthlyCharges	0	1	64.76	30.09	18.25	35.50	70.35	89.85	118.75	▇▅▆▇▅
TotalCharges	11	1	2283.30	2266.77	18.80	401.45	1397.47	3794.74	8684.80	▇▂▂▂▁

Modeling with `tidymodels`

We’ll skip EDA in this post and get right down to business. We’re going to use ’tidymodels` to fit and evaluate a logistic regression model.

Train Test Split

resample is a core package in tidymodels. It provides a streamlined way to create a randmised training and test split of the original data. If we want to achieve an 80:20 split the inputs to the resample function are straightforward: data = telco, prop = 0.80. We also set a seed so we can reproduce the results. With that we get an 80:20 split - of the 7,043 total customers, 5,626 have been assigned to the training set and 1,406 to the test set. We now have created our training set = train_tbl and our test set = test_tbl. That was easy.

set.seed(seed = 4763) 
train_test_split <-
  rsample::initial_split(
    data = telco,     
    prop = 0.80   
  ) 
train_test_split

## <Training/Validation/Total>
## <5635/1408/7043>

train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()

Recipes

tidymodels via the recipes package uses a cooking metaphor for data preprocessing - missing values, imputation, centering and scaling and one-hot-encoding.

The first step is to create our recipe. This is were we will define the transformations we want to apply to our data. For our blog post we’ll simply change all of the characters variables to factors, but there a lot more we could do.

Next we prep the recipe by mixing transforms with the data. This can all be included in a function.

recipe_simple <- function(dataset) {
  recipe(Churn ~ ., data = dataset) %>%
    step_string2factor(all_nominal(), -all_outcomes()) %>%
    prep(data = dataset)
}

recipe_prepped <- recipe_simple(dataset= train_tbl)

The final step in the process, to continue with the cooking metaphor, is to bake the recipe. This is how our preprocessing steps are applied to the data.

train_baked <- bake(recipe_prepped, new_data = train_tbl)
test_baked  <- bake(recipe_prepped, new_data = test_tbl)

Build The Model

tidymodels leans on the parsnip package for its model building. parsnip offers a unified API that allows access to a variety of analytic packages without the requirement of learning the syntax for each package. It only takes three simple steps to fit models:

Pick the type of Model - we are going to use logistic regression
Specify the engin - we’ll use glm
Define the model specification / formula and data - We’ll use MonthChargesk, tenure and gender

logistic_glm<-
  logistic_reg(mode="classification") %>% 
  set_engine("glm") %>% 
  fit(Churn ~ MonthlyCharges + tenure + gender, data = train_baked)

If we wanted to switch to a different engine, all we would have to do change the set_engine argument to the desired tool and parsnip handles all dirty work behind the scenes - I said unified API!

set_engine(“glmnet”)
set_engine(“lm”)
set_engine(“spark”)
set_engine(“keras”)

There’s a long list of engines we can use.

Model Evaluation

After building our model, its time to evaluation the results. This is where the yardstick packages comes in. yardstick provide a simple way to calculate several popular assessment measures. But before we can do that we’ll need some predictions. We get our predictions by passing the test_baked data to the predict function

predictions_glm <- logistic_glm %>%
  predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% select(Churn))

Now with our freshly minted predictions there are several metrics that can be used to evaluate the performance of our classification model. For the sake of simplicity we will focus on some the metrics I introduced in Blog Post 3 - Accuracy, Precision Recall and F1_Score. You may remember that all these measures are derived from the Confusion Matirx, a table that describes the performance of a classification model.

Refer back to Blog Post 3 if you need a refresher on the Confusion Matrix.

Tah Dah

Now with just a few more lines of code with have a Confussion Matrix that can be leveraged to calculate additional metrics.

predictions_glm %>%
  conf_mat(Churn, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)

Accuracy

The model’s Accuracy is the fraction of predictions the model got right and can be easily calculated by passing the predictions_glm to the metrics function. However, accuracy is not a very reliable metric as it will provide misleading results if the data set is unbalanced.

predictions_glm %>%
  metrics(Churn, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy")

## # A tibble: 1 x 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 accuracy     0.786

Precision and Recall

Precision shows how sensitive models are to False Positives (i.e. predicting a customer is leaving when he-she is actually staying) whereas Recall looks at how sensitive models are to False Negatives.

These are very relevant business metrics because organisations are particularly interested in accurately predicting which customers are truly at risk of leaving so that they can target them with retention strategies. At the same time they want to minimising efforts of retaining customers incorrectly classified as leaving who are instead staying.

tibble(
  "precision" = 
     precision(predictions_glm, Churn, .pred_class) %>%
     select(.estimate),
  "recall" = 
     recall(predictions_glm, Churn, .pred_class) %>%
     select(.estimate)
) %>%
  unnest() %>%
  kable()

## Warning: `cols` is now required.
## Please use `cols = c(precision, recall)`

precision	recall
0.8252853	0.9012464

F1 Score

The F1 Score is the harmonic average of the precision and recall. An F1 score reaches its best value at 1 with perfect precision and recall.

predictions_glm %>%
  f_meas(Churn, .pred_class) %>%
  select(-.estimator) %>%
  kable()

.metric	.estimate
f_meas	0.8615949

Wrap Up

That a quick introuduction to tidymodels. I you are interested in a more detailed explanation you might want to checkout the talk at the recent: rstudioConference

Thanks for reading.