A regression model predicts a numeric or continuous value.
A classification model predicts a class label or group membership.

We consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau.

# packages
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip

## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --

## v broom        0.7.9      v rsample      0.1.0 
## v dials        0.0.10     v tune         0.1.6 
## v infer        1.0.0      v workflows    0.2.4 
## v modeldata    0.1.1      v workflowsets 0.1.0 
## v parsnip      0.1.7      v yardstick    0.0.8 
## v recipes      0.1.17

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.

library(textrecipes)
# install.packages("discrim")
library(discrim)

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

library(tictoc)
# install.packages("naivebayes")
# install.packages("glmnet")
library(hardhat)
library(themis)

## Registered S3 methods overwritten by 'themis':
##   method                  from   
##   bake.step_downsample    recipes
##   bake.step_upsample      recipes
##   prep.step_downsample    recipes
##   prep.step_upsample      recipes
##   tidy.step_downsample    recipes
##   tidy.step_upsample      recipes
##   tunable.step_downsample recipes
##   tunable.step_upsample   recipes

## 
## Attaching package: 'themis'

## The following objects are masked from 'package:recipes':
## 
##     step_downsample, step_upsample

complaints <- read_csv("Datasets/complaints.csv.gz",
                       show_col_types = FALSE)

glimpse(complaints)

## Rows: 117,214
## Columns: 18
## $ date_received                <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ product                      <chr> "Debt collection", "Credit reporting, cre~
## $ sub_product                  <chr> "I do not know", "Credit reporting", "I d~
## $ issue                        <chr> "Attempts to collect debt not owed", "Inc~
## $ sub_issue                    <chr> "Debt is not yours", "Information belongs~
## $ consumer_complaint_narrative <chr> "transworld systems inc. \nis trying to c~
## $ company_public_response      <chr> NA, "Company has responded to the consume~
## $ company                      <chr> "TRANSWORLD SYSTEMS INC", "TRANSUNION INT~
## $ state                        <chr> "FL", "CA", "NC", "RI", "FL", "TX", "SC",~
## $ zip_code                     <chr> "335XX", "937XX", "275XX", "029XX", "333X~
## $ tags                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided    <chr> "Consent provided", "Consent provided", "~
## $ submitted_via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company         <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed            <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id                 <dbl> 3384392, 3417821, 3433198, 3366475, 33853~

summary(complaints)

##  date_received          product          sub_product           issue          
##  Min.   :2019-01-01   Length:117214      Length:117214      Length:117214     
##  1st Qu.:2019-04-02   Class :character   Class :character   Class :character  
##  Median :2019-06-27   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2019-06-25                                                           
##  3rd Qu.:2019-09-16                                                           
##  Max.   :2020-01-09                                                           
##   sub_issue         consumer_complaint_narrative company_public_response
##  Length:117214      Length:117214                Length:117214          
##  Class :character   Class :character             Class :character       
##  Mode  :character   Mode  :character             Mode  :character       
##                                                                         
##                                                                         
##                                                                         
##    company             state             zip_code             tags          
##  Length:117214      Length:117214      Length:117214      Length:117214     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  consumer_consent_provided submitted_via      date_sent_to_company
##  Length:117214             Length:117214      Min.   :2019-01-01  
##  Class :character          Class :character   1st Qu.:2019-04-03  
##  Mode  :character          Mode  :character   Median :2019-06-27  
##                                               Mean   :2019-06-25  
##                                               3rd Qu.:2019-09-16  
##                                               Max.   :2020-01-09  
##  company_response_to_consumer timely_response    consumer_disputed 
##  Length:117214                Length:117214      Length:117214     
##  Class :character             Class :character   Class :character  
##  Mode  :character             Mode  :character   Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##   complaint_id    
##  Min.   :3113204  
##  1st Qu.:3199571  
##  Median :3288781  
##  Mean   :3288551  
##  3rd Qu.:3375441  
##  Max.   :3491200

Goal in this chapter: to build classification models to predict what type of financial product the complaints are referring to, i.e., a label or categorical variable.

7.1 A first classification model

For our first model, let’s build a binary classification model to predict whether a submitted complaint is about

Credit reporting, credit repair services, or other personal consumer reports, or
Other.

unique(complaints$product)

## [1] "Debt collection"                                                             
## [2] "Credit reporting, credit repair services, or other personal consumer reports"
## [3] "Money transfer, virtual currency, or money service"                          
## [4] "Mortgage"                                                                    
## [5] "Student loan"                                                                
## [6] "Credit card or prepaid card"                                                 
## [7] "Checking or savings account"                                                 
## [8] "Payday loan, title loan, or personal loan"                                   
## [9] "Vehicle loan or lease"

This kind of “yes or no” binary classification model is both common and useful in real-world text machine learning problems.

The outcome variable product contains more categories than this, so we need to transform this variable to only contain the values “Credit reporting, credit repair services, or other personal consumer reports” and “Other.”

It is always a good idea to look at your data! Here are the first six complaints:

head(complaints$consumer_complaint_narrative)

## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information."
## [6] "I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX"

Observations

The complaint narratives contain many series of capital “X”’s.
These strings (like “XX/XX” or “XXXX XXXX XXXX XXXX”) are used to to protect personally identifiable information (PII) in this publicly available data set.
This is not a universal censoring mechanism; censoring and PII protection will vary from source to source.
Hopefully you will be able to find information on PII censoring in a data dictionary, but you should always look at the data yourself to verify.

We also see that monetary amounts are surrounded by curly brackets (like "{$21.00}"); this is another text preprocessing step that has been taken care of for us.

We could craft a regular expression to extract all the dollar amounts.

ccn <- complaints$consumer_complaint_narrative %>%
  str_extract_all(pattern = "\\{\\$[0-9\\.]*\\}") %>%
  compact()  # discards elements that are NULL or that have length zero
 head(ccn,3)

## [[1]]
## [1] "{$21.00}" "{$1.00}" 
## 
## [[2]]
## [1] "{$2300.00}"
## 
## [[3]]
## [1] "{$200.00}"  "{$5000.00}" "{$5000.00}" "{$770.00}"  "{$800.00}" 
## [6] "{$5000.00}"

7.1.1 Building our first classification model

This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable consumer_complaint_narrative.
Let’s create a factor outcome variable product with two levels,
“Credit” and
“Other.”
Then, we split the data into training and testing data sets.
We can use the initial_split() function from rsample to create this binary split of the data.
The strata argument ensures that the distribution of product is similar in the training set and testing set.
Since the split uses random sampling, we set a seed so we can reproduce our results.

Training and Testing Data Sets

# package 02
# library(tidymodels)
########
set.seed(1234)
complaints2class <- complaints %>%
  mutate(product = factor(
    if_else(
      product == paste("Credit reporting, credit repair services,",
                     "or other personal consumer reports"),
      "Credit", 
      "Other"
      )))
names(complaints2class)

##  [1] "date_received"                "product"                     
##  [3] "sub_product"                  "issue"                       
##  [5] "sub_issue"                    "consumer_complaint_narrative"
##  [7] "company_public_response"      "company"                     
##  [9] "state"                        "zip_code"                    
## [11] "tags"                         "consumer_consent_provided"   
## [13] "submitted_via"                "date_sent_to_company"        
## [15] "company_response_to_consumer" "timely_response"             
## [17] "consumer_disputed"            "complaint_id"

unique(complaints2class$product)

## [1] Other  Credit
## Levels: Credit Other

complaints_split <- initial_split(complaints2class, strata = product)
dim(complaints_split)

##   analysis assessment          n          p 
##      87910      29304     117214         18

complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)

scales::percent(nrow(complaints_train)/nrow(complaints_split$data))

## [1] "75%"

scales::percent(nrow(complaints_test)/nrow(complaints_split$data))

## [1] "25%"

Preprocessing the data to prepare it for modeling

## package 03
##  library(textrecipes)
complaints_rec <-
  recipe(product ~ consumer_complaint_narrative, 
         data = complaints_train) %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative)
names(complaints_rec)

## [1] "var_info"  "term_info" "steps"     "template"  "levels"    "retained"

Now that we have a full specification of the preprocessing recipe, we can build up a tidymodels workflow() to bundle together our modeling components.

complaint_wf <- workflow() %>%
  add_recipe(complaints_rec)

complaint_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: None
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()

Let’s start with a naive Bayes model (S. Kim et al. 2006; Kibriya et al. 2005; Frank and Bouckaert 2006), which is available in the tidymodels package discrim.

One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.

In tidymodels, the package for creating model specifications is parsnip (Kuhn and Vaughan 2021b). The parsnip package provides the functions for creating all the models we have used so far, but other extra packages provide more. The discrim package is an extension package for parsnip that contains model definitions for various discriminant analysis models, including naive Bayes.

Model Specifications

Naive Bayes by StatQuest

## package05
## library(discrim)
nb_spec <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_spec

## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Now we have everything we need to fit our first classification model.

We can add the naive Bayes model to our workflow, and
then we can fit this workflow to our training data.

tic("process time") ## process time about 1 min
nb_fit <- complaint_wf %>%
  add_model(nb_spec) %>%
  fit(data = complaints_train)
toc()

## tic("process time") # process time about 40 mins
## saveRDS(nb_fit, "Output/regression_nb_fit.rds")
## toc()
tic("process time") # process time 7 secs
nb_fit <- readRDS("Output/classification_nb_fit.rds")
toc()

We have trained our first classification model!

7.1.2 Evaluation

We will use resampling methods to evaluate our model.

Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit.
We can do this using resampled data sets built from the training set.
Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.

set.seed(234)
complaints_folds <- vfold_cv(complaints_train)

glimpse(complaints_folds)

## Rows: 10
## Columns: 2
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[7911~
## $ id     <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "Fo~

Each of these splits contains information about how to create cross-validation folds from the original training data.
In this example, 90% of the training data is included in each fold, and
the other 10% is held out for evaluation.

For convenience, let’s again use a workflow() for our resampling estimates of performance.

Using a workflow() isn’t required (you can fit or tune a model plus a preprocessor), but it can make your code easier to read and organize.

nb_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(nb_spec)

nb_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Now, to estimate how well that model performs, let’s fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.

tic("process time")  # process time about 10 mins
nb_rs <- fit_resamples(
  nb_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)
toc()

## tic("process time")  ## process time about 1 min
## saveRDS(nb_rs, "Output/regression_nb_rs.rds")
## toc()
tic("process time")  ## process time about 1 min
nb_rs <- readRDS("Output/classification_nb_rs.rds")
toc()

## process time: 11.19 sec elapsed

nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)

nb_rs_metrics

The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator characteristic curve).

For these resamples, the average accuracy is 80.2%.

Accuracy and ROC AUC are performance metrics used for classification models. For both, values closer to 1 are better.

Accuracy is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets.

ROC AUC measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing.

nb_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = product, .pred_Credit) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for US Consumer Finance Complaints",
    subtitle = "Each resample fold is shown in a different color"
  )

Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition, in comparison to a ‘Gold Standard’ or definition.
Sensitivity (True Positive Rate) refers to the proportion of those who received a positive result on this test out of those who actually have the condition (when judged by the ‘Gold Standard’).
Specificity (True Negative Rate) refers to the proportion of those who received a negative result on this test out of those who do not actually have the condition (when judged by the ‘Gold Standard’).

The area under each of these curves is the roc_auc metric we have computed.

If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.

Another way to evaluate our model is to evaluate the confusion matrix.

A confusion matrix tabulates a model’s false positives and false negatives for each class.
The function conf_mat_resampled() computes a separate confusion matrix for each resample and takes the average of the cell counts.
This allows us to visualize an overall confusion matrix rather than needing to examine each resample individually.

conf_mat_resampled(nb_rs, tidy = FALSE)

##        Credit  Other
## Credit 2864.7  438.2
## Other  1302.1 4186.0

conf_mat_resampled(nb_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

In the figure the squares for “Credit”/“Credit” and “Other”/“Other” have a darker shade than the off-diagonal squares.

This is a good sign, meaning that our model is right more often than not!
However, this first model is struggling somewhat since many observations from the “Credit” class are being mispredicted as “Other.”

One metric alone cannot give you a complete picture of how well your classification model is performing. The confusion matrix is a good starting point to get an overview of your model performance, as it includes rich information.

7.2 Compare to the null model

We can build a classification null_model() specification and add it to a workflow() with the same preprocessing recipe we used in the previous section, to estimate performance.

null_classification <- null_model() %>%
  set_engine("parsnip") %>%
  set_mode("classification")

tic("process time") ## process time is about 9 minutes
null_rs <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(null_classification) %>%
  fit_resamples(
    complaints_folds
  )
toc()

## tic("process time") ## process time is about 1 min
## saveRDS(null_rs, "Output/regression_null_rs.rds")
## toc()
tic("process time") ## process time is about 9 min
null_rs <- readRDS("Output/classification_null_rs.rds")
toc()

## process time: 11.06 sec elapsed

null_rs %>%
  collect_metrics()

The accuracy and ROC AUC indicate that this null model is dramatically worse than even our first model.
The text of the CFPB complaints is predictive relative to the category we are building models for.

7.3 Compare to a lasso classification model

Regularized linear models are a class of statistical model that can be used in regression and classification tasks.
Linear models are not considered cutting edge in NLP research, but are a workhorse in real-world practice.
Here we will use a lasso regularized model (Tibshirani 1996), where the regularization method also performs variable selection.
LASSO is Least Absolute Shrinkage and Selection operator.
It was introduced in order to improve the prediction accuracy and interpretability of regression models.
It was originally introduced in geophysics and later by Robert Tibshirani who coined the term.
In text analysis, we typically have many tokens, which are the features in our machine learning problem.

Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.

Lasso regression or classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
Let’s create a specification of a lasso regularized model. Remember that in tidymodels, specifying a model has three components: the algorithm, the mode, and the computational engine.

Video: Robust, Interpretable Statistical Models: Sparse Regression with the LASSO

lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) %>% # lasso : `mixture = 1`
  set_mode("classification") %>%
  set_engine("glmnet")

lasso_spec

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 1
## 
## Computational engine: glmnet

Then we can create anotherworkflow() object with the lasso specification. Notice that we can reuse our text preprocessing recipe.

lasso_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(lasso_spec)

lasso_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 1
## 
## Computational engine: glmnet

Now we estimate the performance of this first lasso classification model with fit_resamples().

tic("process_time") # process time is about 22 minutes
set.seed(2020)
lasso_rs <- fit_resamples(
  lasso_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)
toc()

# tic("process time") ## process time is about 1 min
# saveRDS(lasso_rs, "Output/classification_lasso_rs.rds")
# toc()
tic("process time") ## process time is about  9 sec
lasso_rs <- readRDS("Output/classification_lasso_rs.rds")
toc()

## process time: 10.69 sec elapsed

Extracting information using `collect_metrics()` and `collect_predictions()`

lasso_rs_metrics <- collect_metrics(lasso_rs)
lasso_rs_predictions <- collect_predictions(lasso_rs)

lasso_rs_metrics

This looks pretty promising, considering we haven’t yet done any tuning of the lasso hyperparameters.
The figure below shows the ROC curves for this regularized model on each of the resampled data sets.

lasso_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = product, .pred_Credit) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for US Consumer Finance Complaints",
    subtitle = "Each resample fold is shown in a different color"
  )

Generating confusion matrix

Our lasso model is better at separating the classes than the naive Bayes model in the previous section and our results are more symmetrical than those for the naive Bayes model.

conf_mat_resampled(lasso_rs, tidy = FALSE)

##        Credit  Other
## Credit 3421.6  401.8
## Other   745.2 4222.4

conf_mat_resampled(lasso_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

7.4 Tuning lasso hyperparameters

The value penalty = 0.01 for regularization in the previous was picked somewhat arbitrarily.
How do we know the right or best regularization parameter penalty?
This is a model hyperparameter, and we cannot learn its best value during model training, but we can estimate the best value by training many models on resampled data sets and exploring how well all these models perform.
Let’s build a new model specification for model tuning.

tune_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

tune_spec

## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

After the tuning process, we can select a single best numeric value.

Think of tune() here as a placeholder for the regularization penalty.

We can create a regular grid of values to try, using a convenience function for penalty().

lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid

The function grid_regular() is from the dials package.
It chooses sensible values to try for a parameter like the regularization penalty; here, we asked for 30 different possible values.
Let’s use tune_grid() to fit a model at each of the values for the regularization penalty in our regular grid.

n tidymodels, the package for tuning is called tune. Tuning a model uses a similar syntax compared to fitting a model to a set of resampled data sets for the purposes of evaluation (fit_resamples()) because the two tasks are so similar. The difference is that when you tune, each model that you fit has different parameters and you want to find the best one.

Resampling with Tuning

add our tunable model specification tune_spec to a workflow with the same preprocessing recipe we’ve been using so far, and then fit it to every possible parameter in lambda_grid and every resample in complaints_folds with tune_grid().

tune_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(tune_spec)

tic("process time") # about 22 min
set.seed(2020)
tune_rs <- tune_grid(
  tune_wf,
  complaints_folds,
  grid = lambda_grid,
  control = control_resamples(save_pred = TRUE)
)
toc()

# tic("process time") # about 1 min
# saveRDS(tune_rs, "Output/classification_tune_rs.rds")
# toc()
tic("process time") # about  10 secs
tune_rs <- readRDS("Output/classification_tune_rs.rds")
toc()

## process time: 11.4 sec elapsed

glimpse(tune_rs)

## Rows: 10
## Columns: 5
## $ splits       <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id           <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics     <list> [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]~
## $ .notes       <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>],~
## $ .predictions <list> [<tbl_df[263730 x 7]>], [<tbl_df[263730 x 7]>], [<tbl_df~

Now, instead of one set of metrics, we have a set of metrics for each value of the regularization penalty.

collect_metrics(tune_rs)

Visualizing the metrics

autoplot(tune_rs) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identity the best penalty"
  )

We can view the best results with show_best() and a choice for the metric, such as ROC AUC.

tune_rs %>%
  show_best("roc_auc") %>%
  mutate(mean = round(mean, 4))

The best value for ROC AUC from this tuning run is 0.953.
We can extract the best regularization parameter for this value of ROC AUC from our tuning results with select_best(), or a simpler model with higher regularization with select_by_pct_loss() or select_by_one_std_err()
Let’s choose the model with the best ROC AUC within one standard error of the numerically best model (Breiman et al. 1984).

chosen_auc <- tune_rs %>%
  select_by_one_std_err(metric = "roc_auc", -penalty) %>%
  mutate(mean = round(mean, 4))

chosen_auc

tune_rs %>%
  select_best(metric = "roc_auc")

tune_rs %>%
  select_best(metric = "accuracy")

Next, let’s finalize our tunable workflow with the regularization penalty,

.

This is the regularization penalty that our tuning results indicate give us the best model.

final_lasso <- finalize_workflow(tune_wf, chosen_auc)
final_lasso

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.000788046281566992
##   mixture = 1
## 
## Computational engine: glmnet

Instead of penalty = tune() like before, now our workflow has finalized values for all arguments.
The preprocessing recipe has been evaluated on the training data, and we tuned the regularization penalty so that we have a penalty value of 0.00079.
This workflow is ready to go! It can now be fit to our training data.

tic("process time")  # process time is about 2 min
fitted_lasso <- fit(final_lasso, complaints_train)
toc()

# tic("process time") # About 34 secs
# saveRDS(fitted_lasso, "Output/classification_fitted_lasso.rds")
# toc()
tic("process time") # about 5 sec
fitted_lasso <- readRDS("Output/classification_fitted_lasso.rds")
toc()

## process time: 7.44 sec elapsed

fitted_lasso.tidy <- 
  fitted_lasso %>%
  pull_workflow_fit() %>%
  tidy() %>%
  mutate(estimate = round(estimate,1))  %>%
  select(-penalty)

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-2

#####
head(fitted_lasso.tidy)

What does the result look like?

for the penalty of 0.0007880463, the terms that contribute the most to a complaint not being about credit.
The words are largely about mortgages and other financial products.

# not about credit
fitted_lasso.tidy %>%
  filter(estimate > 10) %>%
  arrange(-estimate) %>%
  separate(term, c(NA,NA,NA,NA, "term")) %>%
  knitr:: kable()

term	estimate
funds	27.6
appraisal	22.9
escrow	21.0
bonus	20.7
debt	18.5
emailed	16.4
money	16.1
interest	15.7
afford	15.5
fees	14.9
merchant	14.9
branch	14.6
cash	14.5
collect	14.4
servicer	13.8
fee	13.4
transfer	13.0
refusing	12.6
purchases	12.3
affidavit	12.2
fdcpa	11.9
refund	11.8
thought	11.3
paypal	11.2
processing	11.2
bank	10.9
calling	10.8
title	10.8
purchased	10.7
closing	10.5
at	10.3
debit	10.3
customers	10.2
deposited	10.2

What terms contribute to a complaint being about credit reporting, for this tuned model?

Here we see the names of the credit reporting agencies and words about credit inquiries.

fitted_lasso.tidy %>%
  filter(estimate < -10) %>%
  arrange(estimate) %>%
  separate(term, c(NA,NA,NA,NA, "term")) %>%
  knitr:: kable()

term	estimate
reseller	-90.9
experian	-56.9
transunion	-50.1
equifax	-48.1
compliant	-23.7
reporting	-21.1
freeze	-20.9
inquiries	-19.0
report	-18.6
method	-16.3
inquiry	-15.0
physically	-15.0
delinquency	-14.5
investigation	-14.1
late	-14.0
accuracy	-13.9
delinquent	-13.9
score	-13.2
scores	-12.5
credit	-12.2
run	-12.2
investigate	-12.1
accounts	-12.0
items	-11.9
companies	-11.4
dropped	-11.3
correct	-11.1
addresses	-11.0
corrected	-10.8
creditors	-10.5
compliance	-10.4
record	-10.3
verifiable	-10.1

Since we are using a linear model, the model coefficients are directly interpretable and transparently give us variable importance. Many models useful for machine learning with text do not have such transparent variable importance; in those situations, you can use other model-independent or model-agnostic approaches like permutation variable importance.

7.5 Case study: sparse encoding

We can change how our text data is represented to take advantage of its sparsity, especially for models like lasso regularized models.
The regularized regression model we have been training in previous sections used set_engine(“glmnet”); this computational engine can be more efficient when text data is transformed to a sparse matrix (Section 5.1), rather than a dense dataframe or tibble representation.

To keep our text data sparse throughout modeling and use the sparse capabilities of set_engine(“glmnet”), we need to explicitly set a non-default preprocessing blueprint, using the package hardhat (Vaughan and Kuhn 2020).

The hardhat package is used by other tidymodels packages like recipes and parsnip under the hood. As a tidymodels user, you typically don’t use hardhat functions directly. The exception is when you need to customize something about your model or preprocessing, like in this sparse data example.

# package06
# library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model.

The composition “dgCMatrix” is the most common sparse matrix type, from the Matrix package (Bates and Maechler 2021), used in R for modeling.
We can use this blueprint argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.

sparse_wf <- workflow() %>%
  add_recipe(complaints_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

sparse_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels.
Let’s restrict the values this time using the range argument, so we don’t test out as small values for regularization, and only try 20 levels.

smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20)
smaller_lambda %>%
  mutate(penalty = format(penalty, scientific=FALSE)) %>%
  knitr::kable()

penalty
0.00001000000
0.00001832981
0.00003359818
0.00006158482
0.00011288379
0.00020691381
0.00037926902
0.00069519280
0.00127427499
0.00233572147
0.00428133240
0.00784759970
0.01438449888
0.02636650899
0.04832930239
0.08858667904
0.16237767392
0.29763514416
0.54555947812
1.00000000000

We can tune this lasso regression model, in the same way that we did in Section 7.4.
We will fit and assess each possible regularization parameter on each resampling fold, to find the best amount of regularization.

tic("process time") # about 12 minutes
set.seed(2020)
sparse_rs <- tune_grid(
  sparse_wf,
  complaints_folds,
  grid = smaller_lambda
)
toc()

# tic("process time") # process time about 1 min and 15 secs
# saveRDS(sparse_rs, "Output/classification_sparse_rs.rds")
# toc()
tic("process time") # process time about 11 secs
sparse_rs <- readRDS("Output/classification_sparse_rs.rds")
toc()

## process time: 10.86 sec elapsed

glimpse(sparse_rs)

## Rows: 10
## Columns: 4
## $ splits   <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[79~
## $ id       <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "~
## $ .metrics <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], ~
## $ .notes   <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<t~

How did this model turn out, especially compared to the tuned model that did not use the sparse capabilities of set_engine("glmnet")?

sparse_rs %>%
  show_best("roc_auc") %>%
  mutate(penalty = round(penalty, 6),
         mean = round(mean, digits=4),
         std_err = round(std_err, digits = 6))

The best ROC AUC is nearly identical;
the best ROC AUC for the non-sparse tuned lasso model in Section 7.4 was 0.953.
The best regularization parameter (penalty) is a little different (the best value in Section 7.4 was 0.00079), but we used a different grid so didn’t try out exactly the same values.
We ended up with nearly the same performance and best tuned model.
Importantly, this tuning also took a bit less time to complete.
The preprocessing was not much faster, because tokenization and computing tf-idf take a long time.
The model fitting was much faster, because for highly sparse data, this implementation of regularized regression is much faster for sparse matrix input than any dense input.
Overall, the whole tuning workflow is about 10% faster using the sparse preprocessing blueprint.
Depending on how computationally expensive your preprocessing is relative to your model and how sparse your data is, you may expect to see larger (or smaller) gains from moving to a sparse data representation.

Since our model performance is about the same and we see gains in training time, let’s use this sparse representation for the rest of this chapter.

7.6 Two-class or multiclass?

Most of this chapter focuses on binary classification, where we have two classes in our outcome variable (such as “Credit” and “Other”) and each observation can either be one or the other.
This is a simple scenario with straightforward evaluation strategies because the results only have a two-by-two contingency matrix.
However, it is not always possible to limit a modeling question to two classes.
Let’s explore how to deal with situations where we have more than two classes.
The CFPB complaints data set in this chapter has nine different product classes.
In decreasing frequency, they are:
- Credit reporting, credit repair services, or other personal consumer reports
- Debt collection
- Credit card or prepaid card
- Mortgage
- Checking or savings account
- Student loan
- Vehicle loan or lease
- Money transfer, virtual currency, or money service
- Payday loan, title loan, or personal loan
We assume that there is a reason why these product classes have been created in this fashion by this government agency.
Perhaps complaints from different classes are handled by different people or organizations.
Whatever the reason, in this section we would like to build a multiclass classifier to identify these nine specific product classes.
We need to create a new split of the data using initial_split() on the unmodified complaints data set.

set.seed(1234)
multicomplaints_split <- initial_split(complaints, strata = product)
dim(multicomplaints_split)

##   analysis assessment          n          p 
##      87910      29304     117214         18

multicomplaints_train <- training(multicomplaints_split)
multicomplaints_test <- testing(multicomplaints_split)

multicomplaints_train %>%
  count(product, sort = TRUE) %>%
  select(n, product) %>%
  knitr::kable()

n	product
41678	Credit reporting, credit repair services, or other personal consumer reports
16688	Debt collection
8689	Credit card or prepaid card
7117	Mortgage
5219	Checking or savings account
2927	Student loan
2030	Vehicle loan or lease
1906	Money transfer, virtual currency, or money service
1656	Payday loan, title loan, or personal loan

When you have multiple classes in your data, it is possible to formulate the multiclass problem in two ways.
With one approach, any given observation can belong to multiple classes.
With the other approach, an observation can belong to one and only one class.
We will be sticking to the second, “one class per observation” model formulation in this section.
There are many different ways to deal with imbalanced data.
We will demonstrate one of the simplest methods, downsampling, where observations from the majority classes are removed during training to achieve a balanced class distribution.
We will be using the themis (Hvitfeldt 2020d) add-on package for recipes which provides the step_downsample() function to perform downsampling.

The themis package provides many more algorithms to deal with imbalanced data during data preprocessing.

We have to create a new recipe specification from scratch, since we are dealing with new training data this time.
The specification multicomplaints_rec is similar to what we created in Section 7.1.
The only changes are that different data is passed to the data argument in the recipe() function (it is now multicomplaints_train) and we have added step_downsample(product) to the end of the recipe specification to downsample after all the text preprocessing.
We want to downsample last so that we still generate features on the full training data set.
The downsampling will then only affect the modeling step, not the preprocessing steps, with hopefully better results.

# package07
# library(themis)
multicomplaints_rec <-
  recipe(product ~ consumer_complaint_narrative,
         data = multicomplaints_train) %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative) %>%
  step_downsample(product)

We also need a new cross-validation object since we are using a different data set.

multicomplaints_folds <- vfold_cv(multicomplaints_train)

We cannot reuse the tuneable lasso classification specification from Section 7.4 because it only works for binary classification.
Some model algorithms and computational engines (examples are most random forests and SVMs) automatically detect when we perform multiclass classification from the number of classes in the outcome variable and do not require any changes to our model specification.
For lasso regularization, we need to create a new special model specification just for the multiclass class using multinom_reg().

multi_spec <- multinom_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

multi_spec

## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

We used the same arguments for penalty and mixture as in Section 7.4, as well as the same mode and engine, but this model specification is set up to handle more than just two classes.
We can combine this model specification with our preprocessing recipe for multiclass data in a workflow().

multi_lasso_wf <- workflow() %>%
  add_recipe(multicomplaints_rec, blueprint = sparse_bp) %>%
  add_model(multi_spec)

multi_lasso_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## * step_downsample()
## 
## -- Model -----------------------------------------------------------------------
## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Now we have everything we need to tune the regularization penalty and find an appropriate value.
Note that we specify save_pred = TRUE, so we can create ROC curves and a confusion matrix later. This is especially beneficial for multiclass classification.

tic("process time") # About 1 hr
multi_lasso_rs <- tune_grid(
  multi_lasso_wf,
  multicomplaints_folds,
  grid = smaller_lambda,
  control = control_resamples(save_pred = TRUE)
)
toc()

# tic("process time") # About 1 min and 7 secs
# saveRDS(multi_lasso_rs, "Output/classification_multi_lasso_rs.rds")
# toc()
 tic("process time") # About 10 secs
 multi_lasso_rs <- readRDS("Output/classification_multi_lasso_rs.rds")
 toc()

## process time: 13.02 sec elapsed

glimpse(multi_lasso_rs)

## Rows: 10
## Columns: 5
## $ splits       <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id           <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics     <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]~
## $ .notes       <list> [<tbl_df[1 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[1 x 1]>],~
## $ .predictions <list> [<tbl_df[175820 x 14]>], [<tbl_df[175820 x 14]>], [<tbl_~

performance metrics?

best_acc <- multi_lasso_rs %>%
  show_best("accuracy")

best_acc %>%
  mutate(penalty = round(penalty,5),
         mean = round(mean, 3),
         std_err = round(std_err, 5))

The accuracy metric naturally extends to multiclass tasks, but even the very best value is quite low at 75.4%, significantly lower than for the binary case in Section 7.4.
This is expected since multiclass classification is a harder task than binary classification.

In binary classification, there is one right answer and one wrong answer; in this multiclass case, there is one right answer and eight wrong answers.

To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices in the figure below.

tic("process time") About 20 seconds
conf_mat_ml_rs <-
  multi_lasso_rs %>%
  collect_predictions() %>%
  filter(penalty == best_acc$penalty) %>%
  filter(id == "Fold01") %>%
  conf_mat(product, .pred_class)
toc()

#  saveRDS(conf_mat_ml_rs, "Output/classification_conf_mat_ml_rs.rds")
conf_mat_ml_rs <- readRDS("Output/classification_conf_mat_ml_rs.rds")

png(filename = "Images/classification_conf_mat_ml_rs.png",
    height = 800, width = 1500, res = 144)
conf_mat_ml_rs %>%
  autoplot(type = "heatmap") +
  scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()

## png 
##   2

The diagonal is fairly well populated, which is a good sign.
This means that the model generally predicted the right class.
The off-diagonal numbers are all the failures and where we should direct our focus.
It is a little hard to see these cases well since the majority class affects the scale.
A trick to deal with this problem is to remove all the correctly predicted observations.

tic("process time")
conf_mat_ml_rs2 <-
  multi_lasso_rs %>%
  collect_predictions() %>%
  filter(penalty == best_acc$penalty) %>%
  filter(id == "Fold01") %>%
  filter(.pred_class != product) %>%
  conf_mat(product, .pred_class) 
toc()

# saveRDS(conf_mat_ml_rs2, "Output/classification_conf_mat_ml_rs2.rds")
conf_mat_ml_rs2 <- readRDS("Output/classification_conf_mat_ml_rs2.rds")

png("Images/classification_conf_mat_ml_rs2.png",
    height = 800, width = 1500, res = 144)
conf_mat_ml_rs2 %>%
  autoplot(type = "heatmap") +
  scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()

## png 
##   2

Now we can more clearly see where our model breaks down in the figure above.
Some of the most common errors are “Credit reporting, credit repair services, or other personal consumer reports” complaints being wrongly being predicted as “Debt collection” or “Credit card of prepaid card” complaints.
Those mistakes by the model are not hard to understand since all deal with credit and debt and do have overlap in vocabulary.
Knowing what the problem is helps us figure out how to improve our model.
The next step for improving our model is to revisit the data preprocessing steps and model selection.
We can look at different models or model engines that might be able to more easily separate the classes.
Now that we have an idea of where the model isn’t working, we can look more closely at the data and attempt to create features that could distinguish between these classes.
In Section 7.9 we will demonstrate how you can create your own custom features.

7.7 Case study: including non-text data

We are building a model from a data set that includes more than text data alone.
Annotations and labels have been added by the CFPB that we can use during modeling, but we need to ensure that only information that would be available at the time of prediction is included in the model.
Otherwise we will be very disappointed once our model is used to predict on new data!
The variables we identify as available for use as predictors are:
date_received
issue
sub_issue
consumer_complaint_narrative
company
state
zip_code
tags
submitted_via
Let’s try including date_received in our modeling, along with the text variable we have already used, consumer_complaint_narrative, and a new variable tags.
The submitted_via variable could have been a viable candidate, but all the entries are “web.”
The other variables like ZIP code could be of use too, but they are categorical variables with many values so we will exclude them for now.

more_vars_rec <-
  recipe(product ~ date_received + tags + consumer_complaint_narrative,
         data = complaints_train)

How should we preprocess the date_received variable?

*We can use thestep_date()` function to extract the month and day of the week (“dow”).

Then we remove the original date variable and convert the new month and day-of-the-week columns to indicator variables with step_dummy().

Categorical variables like the month can be stored as strings or factors, but for some kinds of models, they must be converted to indicator or dummy variables. These are numeric binary variables for the levels of the original categorical variable. For example, a variable called December would be created that is all zeroes and ones specifying which complaints were submitted in December, plus a variable called November, a variable called October, and so on.

glimpse(complaints_train)

## Rows: 87,910
## Columns: 18
## $ date_received                <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ product                      <fct> Credit, Credit, Credit, Credit, Credit, C~
## $ sub_product                  <chr> "Credit reporting", "Credit reporting", "~
## $ issue                        <chr> "Incorrect information on your report", "~
## $ sub_issue                    <chr> "Account information incorrect", "Their i~
## $ consumer_complaint_narrative <chr> "I would like the credit bureau to correc~
## $ company_public_response      <chr> "Company has responded to the consumer an~
## $ company                      <chr> "TRANSUNION INTERMEDIATE HOLDINGS, INC.",~
## $ state                        <chr> "TX", "NC", "NC", "CO", "WI", "CA", "FL",~
## $ zip_code                     <chr> "77004", "285XX", "276XX", "80920", NA, N~
## $ tags                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided    <chr> "Consent provided", "Consent provided", "~
## $ submitted_via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company         <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed            <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id                 <dbl> 3444592, 3214857, 3417374, 3420108, 33801~

more_vars_rec <- more_vars_rec %>%
  step_date(date_received, features = c("month", "dow"), role = "dates") %>%
  step_rm(date_received) %>%
  step_dummy(has_role("dates"))

The tags variable has some missing data.
- We can deal with this by using step_unknown(), which adds a new level to this factor variable for cases of missing data.
- Then we “dummify” (create dummy/indicator variables) the variable with step_dummy().

sum(is.na(complaints_train$tags))

## [1] 73192

more_vars_rec <- more_vars_rec %>%
  step_unknown(tags) %>%
  step_dummy(tags)

Now we add steps to process the text of the complaints, as before.

more_vars_rec <- more_vars_rec %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative)

Let’s combine this more extensive preprocessing recipe that handles more variables together with the tuneable lasso regularized classification model specification.

more_vars_wf <- workflow() %>%
  add_recipe(more_vars_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

more_vars_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 8 Recipe Steps
## 
## * step_date()
## * step_rm()
## * step_dummy()
## * step_unknown()
## * step_dummy()
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Let’s tune this workflow() with our resampled data sets, find a good value for the regularization penalty, and estimate the model’s performance.

tic("process time") # About 11.5 min
set.seed(123)
more_vars_rs <- tune_grid(
  more_vars_wf,
  complaints_folds,
  grid = smaller_lambda,
)
toc()

# tic("process time") # About  1 min
# saveRDS(more_vars_rs,
#         "Output/classification_more_vars_rs.rds")
# toc()
tic("process time") # About  8.6 sec
more_vars_rs <- readRDS("Output/classification_more_vars_rs.rds")
toc()

## process time: 10.78 sec elapsed

We can extract the metrics for the best-performing regularization penalties from these results with show_best() with an option like “roc_auc” or “accuracy” if we prefer.
How did our chosen performance metric turn out for our model that included more than just the text data?

more_vars_rs %>%
  show_best("roc_auc") %>%
  mutate(penalty = round(penalty,5),
         mean = round(mean, 4),
         std_err = round(std_err,5))

We see here that including more predictors did not measurably improve our model performance or even change the regularization. With only text features in Section 7.5 and the same grid and sparse encoding, we achieved an accuracy of 0.953, the same as what we see now by including the features dealing with dates and tags as well.
The best regularization penalty in Section 7.5 was 0.0007 and is about the same here. We can use tidy() and some dplyr manipulation to find at what rank (term_rank) any of the date or tag variables were included in the regularized results, by absolute value of the model coefficient.

Chapter 7 Classification