In this blog post I’m going to provide an introduction to tidymodels. Tidymodels is the successor to the caret package. I you are like me, you may have used caret recently in completing some of your Data 621 homework assignemnts.
Tidymodels is a collection of modeling packages that, like the tidyverse, have consistent API and are designed to work together specifically to support predictive analytics and machine learning. Core tidymodel packages include: parsnip, recipes, rsample and tune. Collectively, these packages provide a grammar for modeling that makes things a lot easier and provide a unified modeling and analysis interface to seamlessly access several model varieteis in R.
In this post we will focus on four different libraries from the tidymodels suite: rsample for data sampling and cross-validation, recipes for data preprocessing, parsnip for model building and yardstick for model evaluation.
We’ll load five packages: tidymodes, kableExtra, tidyverse, skimr and tibble. Tidymodels will be the star of this post.
We will use customer churn data for our analysis. After loading the data and the skimr package provides us a great data overview.
Skimr is a great package and we see that we have a pretty clean data set, with only a handful of missing values (TotalCharges).
| Name | Piped data |
| Number of rows | 7043 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| customerID | 0 | 1 | 10 | 10 | 0 | 7043 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Partner | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Dependents | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| PhoneService | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| MultipleLines | 0 | 1 | 2 | 16 | 0 | 3 | 0 |
| InternetService | 0 | 1 | 2 | 11 | 0 | 3 | 0 |
| OnlineSecurity | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| OnlineBackup | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| DeviceProtection | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| TechSupport | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| StreamingTV | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| StreamingMovies | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
| Contract | 0 | 1 | 8 | 14 | 0 | 3 | 0 |
| PaperlessBilling | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| PaymentMethod | 0 | 1 | 12 | 25 | 0 | 4 | 0 |
| Churn | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SeniorCitizen | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| tenure | 0 | 1 | 32.37 | 24.56 | 0.00 | 9.00 | 29.00 | 55.00 | 72.00 | ▇▃▃▃▆ |
| MonthlyCharges | 0 | 1 | 64.76 | 30.09 | 18.25 | 35.50 | 70.35 | 89.85 | 118.75 | ▇▅▆▇▅ |
| TotalCharges | 11 | 1 | 2283.30 | 2266.77 | 18.80 | 401.45 | 1397.47 | 3794.74 | 8684.80 | ▇▂▂▂▁ |
tidymodelsWe’ll skip EDA in this post and get right down to business. We’re going to use ’tidymodels` to fit and evaluate a logistic regression model.
resample is a core package in tidymodels. It provides a streamlined way to create a randmised training and test split of the original data. If we want to achieve an 80:20 split the inputs to the resample function are straightforward: data = telco, prop = 0.80. We also set a seed so we can reproduce the results. With that we get an 80:20 split - of the 7,043 total customers, 5,626 have been assigned to the training set and 1,406 to the test set. We now have created our training set = train_tbl and our test set = test_tbl. That was easy.
set.seed(seed = 4763)
train_test_split <-
rsample::initial_split(
data = telco,
prop = 0.80
)
train_test_split## <Training/Validation/Total>
## <5635/1408/7043>
tidymodels via the recipes package uses a cooking metaphor for data preprocessing - missing values, imputation, centering and scaling and one-hot-encoding.
The first step is to create our recipe. This is were we will define the transformations we want to apply to our data. For our blog post we’ll simply change all of the characters variables to factors, but there a lot more we could do.
Next we prep the recipe by mixing transforms with the data. This can all be included in a function.
recipe_simple <- function(dataset) {
recipe(Churn ~ ., data = dataset) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
prep(data = dataset)
}
recipe_prepped <- recipe_simple(dataset= train_tbl)The final step in the process, to continue with the cooking metaphor, is to bake the recipe. This is how our preprocessing steps are applied to the data.
tidymodels leans on the parsnip package for its model building. parsnip offers a unified API that allows access to a variety of analytic packages without the requirement of learning the syntax for each package. It only takes three simple steps to fit models:
logistic_glm<-
logistic_reg(mode="classification") %>%
set_engine("glm") %>%
fit(Churn ~ MonthlyCharges + tenure + gender, data = train_baked)If we wanted to switch to a different engine, all we would have to do change the set_engine argument to the desired tool and parsnip handles all dirty work behind the scenes - I said unified API!
There’s a long list of engines we can use.
After building our model, its time to evaluation the results. This is where the yardstick packages comes in. yardstick provide a simple way to calculate several popular assessment measures. But before we can do that we’ll need some predictions. We get our predictions by passing the test_baked data to the predict function
predictions_glm <- logistic_glm %>%
predict(new_data = test_baked) %>%
bind_cols(test_baked %>% select(Churn))Now with our freshly minted predictions there are several metrics that can be used to evaluate the performance of our classification model. For the sake of simplicity we will focus on some the metrics I introduced in Blog Post 3 - Accuracy, Precision Recall and F1_Score. You may remember that all these measures are derived from the Confusion Matirx, a table that describes the performance of a classification model.
Refer back to Blog Post 3 if you need a refresher on the Confusion Matrix.
Now with just a few more lines of code with have a Confussion Matrix that can be leveraged to calculate additional metrics.
predictions_glm %>%
conf_mat(Churn, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)The model’s Accuracy is the fraction of predictions the model got right and can be easily calculated by passing the predictions_glm to the metrics function. However, accuracy is not a very reliable metric as it will provide misleading results if the data set is unbalanced.
predictions_glm %>%
metrics(Churn, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy") ## # A tibble: 1 x 2
## .metric .estimate
## <chr> <dbl>
## 1 accuracy 0.786
Precision shows how sensitive models are to False Positives (i.e. predicting a customer is leaving when he-she is actually staying) whereas Recall looks at how sensitive models are to False Negatives.
These are very relevant business metrics because organisations are particularly interested in accurately predicting which customers are truly at risk of leaving so that they can target them with retention strategies. At the same time they want to minimising efforts of retaining customers incorrectly classified as leaving who are instead staying.
tibble(
"precision" =
precision(predictions_glm, Churn, .pred_class) %>%
select(.estimate),
"recall" =
recall(predictions_glm, Churn, .pred_class) %>%
select(.estimate)
) %>%
unnest() %>%
kable()## Warning: `cols` is now required.
## Please use `cols = c(precision, recall)`
| precision | recall |
|---|---|
| 0.8252853 | 0.9012464 |
The F1 Score is the harmonic average of the precision and recall. An F1 score reaches its best value at 1 with perfect precision and recall.
| .metric | .estimate |
|---|---|
| f_meas | 0.8615949 |
That a quick introuduction to tidymodels. I you are interested in a more detailed explanation you might want to checkout the talk at the recent: rstudioConference
Thanks for reading.