We consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau.

# packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom        0.7.9      v rsample      0.1.0 
## v dials        0.0.10     v tune         0.1.6 
## v infer        1.0.0      v workflows    0.2.4 
## v modeldata    0.1.1      v workflowsets 0.1.0 
## v parsnip      0.1.7      v yardstick    0.0.8 
## v recipes      0.1.17
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(textrecipes)
# install.packages("discrim")
library(discrim)
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness
library(tictoc)
# install.packages("naivebayes")
# install.packages("glmnet")
library(hardhat)
library(themis)
## Registered S3 methods overwritten by 'themis':
##   method                  from   
##   bake.step_downsample    recipes
##   bake.step_upsample      recipes
##   prep.step_downsample    recipes
##   prep.step_upsample      recipes
##   tidy.step_downsample    recipes
##   tidy.step_upsample      recipes
##   tunable.step_downsample recipes
##   tunable.step_upsample   recipes
## 
## Attaching package: 'themis'
## The following objects are masked from 'package:recipes':
## 
##     step_downsample, step_upsample
complaints <- read_csv("Datasets/complaints.csv.gz",
                       show_col_types = FALSE)
glimpse(complaints)
## Rows: 117,214
## Columns: 18
## $ date_received                <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ product                      <chr> "Debt collection", "Credit reporting, cre~
## $ sub_product                  <chr> "I do not know", "Credit reporting", "I d~
## $ issue                        <chr> "Attempts to collect debt not owed", "Inc~
## $ sub_issue                    <chr> "Debt is not yours", "Information belongs~
## $ consumer_complaint_narrative <chr> "transworld systems inc. \nis trying to c~
## $ company_public_response      <chr> NA, "Company has responded to the consume~
## $ company                      <chr> "TRANSWORLD SYSTEMS INC", "TRANSUNION INT~
## $ state                        <chr> "FL", "CA", "NC", "RI", "FL", "TX", "SC",~
## $ zip_code                     <chr> "335XX", "937XX", "275XX", "029XX", "333X~
## $ tags                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided    <chr> "Consent provided", "Consent provided", "~
## $ submitted_via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company         <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed            <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id                 <dbl> 3384392, 3417821, 3433198, 3366475, 33853~
summary(complaints)
##  date_received          product          sub_product           issue          
##  Min.   :2019-01-01   Length:117214      Length:117214      Length:117214     
##  1st Qu.:2019-04-02   Class :character   Class :character   Class :character  
##  Median :2019-06-27   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2019-06-25                                                           
##  3rd Qu.:2019-09-16                                                           
##  Max.   :2020-01-09                                                           
##   sub_issue         consumer_complaint_narrative company_public_response
##  Length:117214      Length:117214                Length:117214          
##  Class :character   Class :character             Class :character       
##  Mode  :character   Mode  :character             Mode  :character       
##                                                                         
##                                                                         
##                                                                         
##    company             state             zip_code             tags          
##  Length:117214      Length:117214      Length:117214      Length:117214     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  consumer_consent_provided submitted_via      date_sent_to_company
##  Length:117214             Length:117214      Min.   :2019-01-01  
##  Class :character          Class :character   1st Qu.:2019-04-03  
##  Mode  :character          Mode  :character   Median :2019-06-27  
##                                               Mean   :2019-06-25  
##                                               3rd Qu.:2019-09-16  
##                                               Max.   :2020-01-09  
##  company_response_to_consumer timely_response    consumer_disputed 
##  Length:117214                Length:117214      Length:117214     
##  Class :character             Class :character   Class :character  
##  Mode  :character             Mode  :character   Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##   complaint_id    
##  Min.   :3113204  
##  1st Qu.:3199571  
##  Median :3288781  
##  Mean   :3288551  
##  3rd Qu.:3375441  
##  Max.   :3491200

7.1 A first classification model

For our first model, let’s build a binary classification model to predict whether a submitted complaint is about

  1. Credit reporting, credit repair services, or other personal consumer reports, or
  2. Other.
unique(complaints$product)
## [1] "Debt collection"                                                             
## [2] "Credit reporting, credit repair services, or other personal consumer reports"
## [3] "Money transfer, virtual currency, or money service"                          
## [4] "Mortgage"                                                                    
## [5] "Student loan"                                                                
## [6] "Credit card or prepaid card"                                                 
## [7] "Checking or savings account"                                                 
## [8] "Payday loan, title loan, or personal loan"                                   
## [9] "Vehicle loan or lease"

This kind of “yes or no” binary classification model is both common and useful in real-world text machine learning problems.

It is always a good idea to look at your data! Here are the first six complaints:

head(complaints$consumer_complaint_narrative) 
## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information."
## [6] "I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX"

Observations

  • The complaint narratives contain many series of capital “X”’s.

  • These strings (like “XX/XX” or “XXXX XXXX XXXX XXXX”) are used to to protect personally identifiable information (PII) in this publicly available data set.

  • This is not a universal censoring mechanism; censoring and PII protection will vary from source to source.

  • Hopefully you will be able to find information on PII censoring in a data dictionary, but you should always look at the data yourself to verify.

  • We also see that monetary amounts are surrounded by curly brackets (like "{$21.00}"); this is another text preprocessing step that has been taken care of for us. 
    • We could craft a regular expression to extract all the dollar amounts.
ccn <- complaints$consumer_complaint_narrative %>%
  str_extract_all(pattern = "\\{\\$[0-9\\.]*\\}") %>%
  compact()  # discards elements that are NULL or that have length zero
 head(ccn,3)
## [[1]]
## [1] "{$21.00}" "{$1.00}" 
## 
## [[2]]
## [1] "{$2300.00}"
## 
## [[3]]
## [1] "{$200.00}"  "{$5000.00}" "{$5000.00}" "{$770.00}"  "{$800.00}" 
## [6] "{$5000.00}"

7.1.1 Building our first classification model

  • This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable consumer_complaint_narrative.

  • Let’s create a factor outcome variable product with two levels,

  • “Credit” and

  • “Other.”

  • Then, we split the data into training and testing data sets.

  • We can use the initial_split() function from rsample to create this binary split of the data.

  • The strata argument ensures that the distribution of product is similar in the training set and testing set.

  • Since the split uses random sampling, we set a seed so we can reproduce our results.

Training and Testing Data Sets

# package 02
# library(tidymodels)
########
set.seed(1234)
complaints2class <- complaints %>%
  mutate(product = factor(
    if_else(
      product == paste("Credit reporting, credit repair services,",
                     "or other personal consumer reports"),
      "Credit", 
      "Other"
      )))
names(complaints2class)
##  [1] "date_received"                "product"                     
##  [3] "sub_product"                  "issue"                       
##  [5] "sub_issue"                    "consumer_complaint_narrative"
##  [7] "company_public_response"      "company"                     
##  [9] "state"                        "zip_code"                    
## [11] "tags"                         "consumer_consent_provided"   
## [13] "submitted_via"                "date_sent_to_company"        
## [15] "company_response_to_consumer" "timely_response"             
## [17] "consumer_disputed"            "complaint_id"
unique(complaints2class$product)
## [1] Other  Credit
## Levels: Credit Other
complaints_split <- initial_split(complaints2class, strata = product)
dim(complaints_split)
##   analysis assessment          n          p 
##      87910      29304     117214         18
complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)
scales::percent(nrow(complaints_train)/nrow(complaints_split$data))
## [1] "75%"
scales::percent(nrow(complaints_test)/nrow(complaints_split$data))
## [1] "25%"

Preprocessing the data to prepare it for modeling

## package 03
##  library(textrecipes)
complaints_rec <-
  recipe(product ~ consumer_complaint_narrative, 
         data = complaints_train) %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative)
names(complaints_rec)
## [1] "var_info"  "term_info" "steps"     "template"  "levels"    "retained"
  • Now that we have a full specification of the preprocessing recipe, we can build up a tidymodels workflow() to bundle together our modeling components.
complaint_wf <- workflow() %>%
  add_recipe(complaints_rec)
complaint_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: None
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()

Let’s start with a naive Bayes model (S. Kim et al. 2006; Kibriya et al. 2005; Frank and Bouckaert 2006), which is available in the tidymodels package discrim.

  • One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.

In tidymodels, the package for creating model specifications is parsnip (Kuhn and Vaughan 2021b). The parsnip package provides the functions for creating all the models we have used so far, but other extra packages provide more. The discrim package is an extension package for parsnip that contains model definitions for various discriminant analysis models, including naive Bayes.

Model Specifications

Naive Bayes by StatQuest

## package05
## library(discrim)
nb_spec <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_spec
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Now we have everything we need to fit our first classification model.

  • We can add the naive Bayes model to our workflow, and
  • then we can fit this workflow to our training data.
tic("process time") ## process time about 1 min
nb_fit <- complaint_wf %>%
  add_model(nb_spec) %>%
  fit(data = complaints_train)
toc()
## tic("process time") # process time about 40 mins
## saveRDS(nb_fit, "Output/regression_nb_fit.rds")
## toc()
tic("process time") # process time 7 secs
nb_fit <- readRDS("Output/classification_nb_fit.rds")
toc()

We have trained our first classification model!

7.1.2 Evaluation

We will use resampling methods to evaluate our model.

  • Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit.

  • We can do this using resampled data sets built from the training set.

  • Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.

set.seed(234)
complaints_folds <- vfold_cv(complaints_train)

glimpse(complaints_folds)
## Rows: 10
## Columns: 2
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[7911~
## $ id     <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "Fo~
  • Each of these splits contains information about how to create cross-validation folds from the original training data.

  • In this example, 90% of the training data is included in each fold, and

  • the other 10% is held out for evaluation.

For convenience, let’s again use a workflow() for our resampling estimates of performance.

Using a workflow() isn’t required (you can fit or tune a model plus a preprocessor), but it can make your code easier to read and organize.

nb_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(nb_spec)

nb_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Now, to estimate how well that model performs, let’s fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.

tic("process time")  # process time about 10 mins
nb_rs <- fit_resamples(
  nb_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)
toc()
## tic("process time")  ## process time about 1 min
## saveRDS(nb_rs, "Output/regression_nb_rs.rds")
## toc()
tic("process time")  ## process time about 1 min
nb_rs <- readRDS("Output/classification_nb_rs.rds")
toc()
## process time: 11.19 sec elapsed
nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)
nb_rs_metrics

The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator characteristic curve).

For these resamples, the average accuracy is 80.2%.

Accuracy and ROC AUC are performance metrics used for classification models. For both, values closer to 1 are better.

Accuracy is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets.

ROC AUC measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing.

nb_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = product, .pred_Credit) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for US Consumer Finance Complaints",
    subtitle = "Each resample fold is shown in a different color"
  )

  • Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition, in comparison to a ‘Gold Standard’ or definition.

  • Sensitivity (True Positive Rate) refers to the proportion of those who received a positive result on this test out of those who actually have the condition (when judged by the ‘Gold Standard’).

  • Specificity (True Negative Rate) refers to the proportion of those who received a negative result on this test out of those who do not actually have the condition (when judged by the ‘Gold Standard’).

The area under each of these curves is the roc_auc metric we have computed.

  • If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.

Another way to evaluate our model is to evaluate the confusion matrix.

  • A confusion matrix tabulates a model’s false positives and false negatives for each class.
  • The function conf_mat_resampled() computes a separate confusion matrix for each resample and takes the average of the cell counts.
  • This allows us to visualize an overall confusion matrix rather than needing to examine each resample individually.
conf_mat_resampled(nb_rs, tidy = FALSE) 
##        Credit  Other
## Credit 2864.7  438.2
## Other  1302.1 4186.0
conf_mat_resampled(nb_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

In the figure the squares for “Credit”/“Credit” and “Other”/“Other” have a darker shade than the off-diagonal squares.

  • This is a good sign, meaning that our model is right more often than not!
  • However, this first model is struggling somewhat since many observations from the “Credit” class are being mispredicted as “Other.”

One metric alone cannot give you a complete picture of how well your classification model is performing. The confusion matrix is a good starting point to get an overview of your model performance, as it includes rich information.

7.2 Compare to the null model

We can build a classification null_model() specification and add it to a workflow() with the same preprocessing recipe we used in the previous section, to estimate performance.

null_classification <- null_model() %>%
  set_engine("parsnip") %>%
  set_mode("classification")
tic("process time") ## process time is about 9 minutes
null_rs <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(null_classification) %>%
  fit_resamples(
    complaints_folds
  )
toc()
## tic("process time") ## process time is about 1 min
## saveRDS(null_rs, "Output/regression_null_rs.rds")
## toc()
tic("process time") ## process time is about 9 min
null_rs <- readRDS("Output/classification_null_rs.rds")
toc()
## process time: 11.06 sec elapsed
null_rs %>%
  collect_metrics()

7.3 Compare to a lasso classification model

Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.

Video: Robust, Interpretable Statistical Models: Sparse Regression with the LASSO

lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) %>% # lasso : `mixture = 1`
  set_mode("classification") %>%
  set_engine("glmnet")

lasso_spec
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 1
## 
## Computational engine: glmnet

Then we can create anotherworkflow() object with the lasso specification. Notice that we can reuse our text preprocessing recipe.

lasso_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(lasso_spec)

lasso_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 1
## 
## Computational engine: glmnet

Now we estimate the performance of this first lasso classification model with fit_resamples().

tic("process_time") # process time is about 22 minutes
set.seed(2020)
lasso_rs <- fit_resamples(
  lasso_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") ## process time is about 1 min
# saveRDS(lasso_rs, "Output/classification_lasso_rs.rds")
# toc()
tic("process time") ## process time is about  9 sec
lasso_rs <- readRDS("Output/classification_lasso_rs.rds")
toc()
## process time: 10.69 sec elapsed

Extracting information using collect_metrics() and collect_predictions()

lasso_rs_metrics <- collect_metrics(lasso_rs)
lasso_rs_predictions <- collect_predictions(lasso_rs)
lasso_rs_metrics
  • This looks pretty promising, considering we haven’t yet done any tuning of the lasso hyperparameters.

  • The figure below shows the ROC curves for this regularized model on each of the resampled data sets.

lasso_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = product, .pred_Credit) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for US Consumer Finance Complaints",
    subtitle = "Each resample fold is shown in a different color"
  )

Generating confusion matrix

  • Our lasso model is better at separating the classes than the naive Bayes model in the previous section and our results are more symmetrical than those for the naive Bayes model.
conf_mat_resampled(lasso_rs, tidy = FALSE) 
##        Credit  Other
## Credit 3421.6  401.8
## Other   745.2 4222.4
conf_mat_resampled(lasso_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

7.4 Tuning lasso hyperparameters

  • The value penalty = 0.01 for regularization in the previous was picked somewhat arbitrarily.

  • How do we know the right or best regularization parameter penalty?

  • This is a model hyperparameter, and we cannot learn its best value during model training, but we can estimate the best value by training many models on resampled data sets and exploring how well all these models perform.

  • Let’s build a new model specification for model tuning.

tune_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

tune_spec
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
  • After the tuning process, we can select a single best numeric value.

Think of tune() here as a placeholder for the regularization penalty.

  • We can create a regular grid of values to try, using a convenience function for penalty().
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
  • The function grid_regular() is from the dials package.

  • It chooses sensible values to try for a parameter like the regularization penalty; here, we asked for 30 different possible values.

  • Let’s use tune_grid() to fit a model at each of the values for the regularization penalty in our regular grid.

n tidymodels, the package for tuning is called tune. Tuning a model uses a similar syntax compared to fitting a model to a set of resampled data sets for the purposes of evaluation (fit_resamples()) because the two tasks are so similar. The difference is that when you tune, each model that you fit has different parameters and you want to find the best one.

Resampling with Tuning

  • add our tunable model specification tune_spec to a workflow with the same preprocessing recipe we’ve been using so far, and then fit it to every possible parameter in lambda_grid and every resample in complaints_folds with tune_grid().
tune_wf <- workflow() %>%
  add_recipe(complaints_rec) %>%
  add_model(tune_spec)
tic("process time") # about 22 min
set.seed(2020)
tune_rs <- tune_grid(
  tune_wf,
  complaints_folds,
  grid = lambda_grid,
  control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") # about 1 min
# saveRDS(tune_rs, "Output/classification_tune_rs.rds")
# toc()
tic("process time") # about  10 secs
tune_rs <- readRDS("Output/classification_tune_rs.rds")
toc()
## process time: 11.4 sec elapsed
glimpse(tune_rs)
## Rows: 10
## Columns: 5
## $ splits       <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id           <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics     <list> [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]~
## $ .notes       <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>],~
## $ .predictions <list> [<tbl_df[263730 x 7]>], [<tbl_df[263730 x 7]>], [<tbl_df~
  • Now, instead of one set of metrics, we have a set of metrics for each value of the regularization penalty.
collect_metrics(tune_rs)

Visualizing the metrics

autoplot(tune_rs) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identity the best penalty"
  )

  • We can view the best results with show_best() and a choice for the metric, such as ROC AUC.
tune_rs %>%
  show_best("roc_auc") %>%
  mutate(mean = round(mean, 4))
  • The best value for ROC AUC from this tuning run is 0.953.

  • We can extract the best regularization parameter for this value of ROC AUC from our tuning results with select_best(), or a simpler model with higher regularization with select_by_pct_loss() or select_by_one_std_err()

  • Let’s choose the model with the best ROC AUC within one standard error of the numerically best model (Breiman et al. 1984).

chosen_auc <- tune_rs %>%
  select_by_one_std_err(metric = "roc_auc", -penalty) %>%
  mutate(mean = round(mean, 4))

chosen_auc
tune_rs %>%
  select_best(metric = "roc_auc")
tune_rs %>%
  select_best(metric = "accuracy")
  • Next, let’s finalize our tunable workflow with the regularization penalty,
    .

This is the regularization penalty that our tuning results indicate give us the best model.

final_lasso <- finalize_workflow(tune_wf, chosen_auc)
final_lasso
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.000788046281566992
##   mixture = 1
## 
## Computational engine: glmnet
  • Instead of penalty = tune() like before, now our workflow has finalized values for all arguments.

  • The preprocessing recipe has been evaluated on the training data, and we tuned the regularization penalty so that we have a penalty value of 0.00079.

  • This workflow is ready to go! It can now be fit to our training data.

tic("process time")  # process time is about 2 min
fitted_lasso <- fit(final_lasso, complaints_train)
toc()
# tic("process time") # About 34 secs
# saveRDS(fitted_lasso, "Output/classification_fitted_lasso.rds")
# toc()
tic("process time") # about 5 sec
fitted_lasso <- readRDS("Output/classification_fitted_lasso.rds")
toc()
## process time: 7.44 sec elapsed
fitted_lasso.tidy <- 
  fitted_lasso %>%
  pull_workflow_fit() %>%
  tidy() %>%
  mutate(estimate = round(estimate,1))  %>%
  select(-penalty) 
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-2
#####
head(fitted_lasso.tidy)

What does the result look like?

  • for the penalty of 0.0007880463, the terms that contribute the most to a complaint not being about credit.

  • The words are largely about mortgages and other financial products.

# not about credit
fitted_lasso.tidy %>%
  filter(estimate > 10) %>%
  arrange(-estimate) %>%
  separate(term, c(NA,NA,NA,NA, "term")) %>%
  knitr:: kable()
term estimate
funds 27.6
appraisal 22.9
escrow 21.0
bonus 20.7
debt 18.5
emailed 16.4
money 16.1
interest 15.7
afford 15.5
fees 14.9
merchant 14.9
branch 14.6
cash 14.5
collect 14.4
servicer 13.8
fee 13.4
transfer 13.0
refusing 12.6
purchases 12.3
affidavit 12.2
fdcpa 11.9
refund 11.8
thought 11.3
paypal 11.2
processing 11.2
bank 10.9
calling 10.8
title 10.8
purchased 10.7
closing 10.5
at 10.3
debit 10.3
customers 10.2
deposited 10.2

What terms contribute to a complaint being about credit reporting, for this tuned model?

  • Here we see the names of the credit reporting agencies and words about credit inquiries.
fitted_lasso.tidy %>%
  filter(estimate < -10) %>%
  arrange(estimate) %>%
  separate(term, c(NA,NA,NA,NA, "term")) %>%
  knitr:: kable()
term estimate
reseller -90.9
experian -56.9
transunion -50.1
equifax -48.1
compliant -23.7
reporting -21.1
freeze -20.9
inquiries -19.0
report -18.6
method -16.3
inquiry -15.0
physically -15.0
delinquency -14.5
investigation -14.1
late -14.0
accuracy -13.9
delinquent -13.9
score -13.2
scores -12.5
credit -12.2
run -12.2
investigate -12.1
accounts -12.0
items -11.9
companies -11.4
dropped -11.3
correct -11.1
addresses -11.0
corrected -10.8
creditors -10.5
compliance -10.4
record -10.3
verifiable -10.1

Since we are using a linear model, the model coefficients are directly interpretable and transparently give us variable importance. Many models useful for machine learning with text do not have such transparent variable importance; in those situations, you can use other model-independent or model-agnostic approaches like permutation variable importance.

7.5 Case study: sparse encoding

To keep our text data sparse throughout modeling and use the sparse capabilities of set_engine(“glmnet”), we need to explicitly set a non-default preprocessing blueprint, using the package hardhat (Vaughan and Kuhn 2020).

The hardhat package is used by other tidymodels packages like recipes and parsnip under the hood. As a tidymodels user, you typically don’t use hardhat functions directly. The exception is when you need to customize something about your model or preprocessing, like in this sparse data example.

# package06
# library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model.

sparse_wf <- workflow() %>%
  add_recipe(complaints_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

sparse_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20)
smaller_lambda %>%
  mutate(penalty = format(penalty, scientific=FALSE)) %>%
  knitr::kable()
penalty
0.00001000000
0.00001832981
0.00003359818
0.00006158482
0.00011288379
0.00020691381
0.00037926902
0.00069519280
0.00127427499
0.00233572147
0.00428133240
0.00784759970
0.01438449888
0.02636650899
0.04832930239
0.08858667904
0.16237767392
0.29763514416
0.54555947812
1.00000000000
tic("process time") # about 12 minutes
set.seed(2020)
sparse_rs <- tune_grid(
  sparse_wf,
  complaints_folds,
  grid = smaller_lambda
)
toc()
# tic("process time") # process time about 1 min and 15 secs
# saveRDS(sparse_rs, "Output/classification_sparse_rs.rds")
# toc()
tic("process time") # process time about 11 secs
sparse_rs <- readRDS("Output/classification_sparse_rs.rds")
toc()
## process time: 10.86 sec elapsed
glimpse(sparse_rs)
## Rows: 10
## Columns: 4
## $ splits   <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[79~
## $ id       <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "~
## $ .metrics <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], ~
## $ .notes   <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<t~
sparse_rs %>%
  show_best("roc_auc") %>%
  mutate(penalty = round(penalty, 6),
         mean = round(mean, digits=4),
         std_err = round(std_err, digits = 6))

Since our model performance is about the same and we see gains in training time, let’s use this sparse representation for the rest of this chapter.

7.6 Two-class or multiclass?

set.seed(1234)
multicomplaints_split <- initial_split(complaints, strata = product)
dim(multicomplaints_split)
##   analysis assessment          n          p 
##      87910      29304     117214         18
multicomplaints_train <- training(multicomplaints_split)
multicomplaints_test <- testing(multicomplaints_split)
multicomplaints_train %>%
  count(product, sort = TRUE) %>%
  select(n, product) %>%
  knitr::kable()
n product
41678 Credit reporting, credit repair services, or other personal consumer reports
16688 Debt collection
8689 Credit card or prepaid card
7117 Mortgage
5219 Checking or savings account
2927 Student loan
2030 Vehicle loan or lease
1906 Money transfer, virtual currency, or money service
1656 Payday loan, title loan, or personal loan

The themis package provides many more algorithms to deal with imbalanced data during data preprocessing.

# package07
# library(themis)
multicomplaints_rec <-
  recipe(product ~ consumer_complaint_narrative,
         data = multicomplaints_train) %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative) %>%
  step_downsample(product)

We also need a new cross-validation object since we are using a different data set.

multicomplaints_folds <- vfold_cv(multicomplaints_train)
multi_spec <- multinom_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

multi_spec
## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
multi_lasso_wf <- workflow() %>%
  add_recipe(multicomplaints_rec, blueprint = sparse_bp) %>%
  add_model(multi_spec)

multi_lasso_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## * step_downsample()
## 
## -- Model -----------------------------------------------------------------------
## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
tic("process time") # About 1 hr
multi_lasso_rs <- tune_grid(
  multi_lasso_wf,
  multicomplaints_folds,
  grid = smaller_lambda,
  control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") # About 1 min and 7 secs
# saveRDS(multi_lasso_rs, "Output/classification_multi_lasso_rs.rds")
# toc()
 tic("process time") # About 10 secs
 multi_lasso_rs <- readRDS("Output/classification_multi_lasso_rs.rds")
 toc()
## process time: 13.02 sec elapsed
glimpse(multi_lasso_rs)
## Rows: 10
## Columns: 5
## $ splits       <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id           <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics     <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]~
## $ .notes       <list> [<tbl_df[1 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[1 x 1]>],~
## $ .predictions <list> [<tbl_df[175820 x 14]>], [<tbl_df[175820 x 14]>], [<tbl_~
best_acc <- multi_lasso_rs %>%
  show_best("accuracy")

best_acc %>%
  mutate(penalty = round(penalty,5),
         mean = round(mean, 3),
         std_err = round(std_err, 5))

In binary classification, there is one right answer and one wrong answer; in this multiclass case, there is one right answer and eight wrong answers.

tic("process time") About 20 seconds
conf_mat_ml_rs <-
  multi_lasso_rs %>%
  collect_predictions() %>%
  filter(penalty == best_acc$penalty) %>%
  filter(id == "Fold01") %>%
  conf_mat(product, .pred_class)
toc()
#  saveRDS(conf_mat_ml_rs, "Output/classification_conf_mat_ml_rs.rds")
conf_mat_ml_rs <- readRDS("Output/classification_conf_mat_ml_rs.rds")
png(filename = "Images/classification_conf_mat_ml_rs.png",
    height = 800, width = 1500, res = 144)
conf_mat_ml_rs %>%
  autoplot(type = "heatmap") +
  scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()
## png 
##   2

tic("process time")
conf_mat_ml_rs2 <-
  multi_lasso_rs %>%
  collect_predictions() %>%
  filter(penalty == best_acc$penalty) %>%
  filter(id == "Fold01") %>%
  filter(.pred_class != product) %>%
  conf_mat(product, .pred_class) 
toc()
# saveRDS(conf_mat_ml_rs2, "Output/classification_conf_mat_ml_rs2.rds")
conf_mat_ml_rs2 <- readRDS("Output/classification_conf_mat_ml_rs2.rds")
png("Images/classification_conf_mat_ml_rs2.png",
    height = 800, width = 1500, res = 144)
conf_mat_ml_rs2 %>%
  autoplot(type = "heatmap") +
  scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()
## png 
##   2

7.7 Case study: including non-text data

more_vars_rec <-
  recipe(product ~ date_received + tags + consumer_complaint_narrative,
         data = complaints_train)

How should we preprocess the date_received variable?

*We can use thestep_date()` function to extract the month and day of the week (“dow”).

  • Then we remove the original date variable and convert the new month and day-of-the-week columns to indicator variables with step_dummy().

Categorical variables like the month can be stored as strings or factors, but for some kinds of models, they must be converted to indicator or dummy variables. These are numeric binary variables for the levels of the original categorical variable. For example, a variable called December would be created that is all zeroes and ones specifying which complaints were submitted in December, plus a variable called November, a variable called October, and so on.

glimpse(complaints_train)
## Rows: 87,910
## Columns: 18
## $ date_received                <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ product                      <fct> Credit, Credit, Credit, Credit, Credit, C~
## $ sub_product                  <chr> "Credit reporting", "Credit reporting", "~
## $ issue                        <chr> "Incorrect information on your report", "~
## $ sub_issue                    <chr> "Account information incorrect", "Their i~
## $ consumer_complaint_narrative <chr> "I would like the credit bureau to correc~
## $ company_public_response      <chr> "Company has responded to the consumer an~
## $ company                      <chr> "TRANSUNION INTERMEDIATE HOLDINGS, INC.",~
## $ state                        <chr> "TX", "NC", "NC", "CO", "WI", "CA", "FL",~
## $ zip_code                     <chr> "77004", "285XX", "276XX", "80920", NA, N~
## $ tags                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided    <chr> "Consent provided", "Consent provided", "~
## $ submitted_via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company         <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed            <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id                 <dbl> 3444592, 3214857, 3417374, 3420108, 33801~
more_vars_rec <- more_vars_rec %>%
  step_date(date_received, features = c("month", "dow"), role = "dates") %>%
  step_rm(date_received) %>%
  step_dummy(has_role("dates"))
  • The tags variable has some missing data.

    • We can deal with this by using step_unknown(), which adds a new level to this factor variable for cases of missing data.

    • Then we “dummify” (create dummy/indicator variables) the variable with step_dummy().

sum(is.na(complaints_train$tags))
## [1] 73192
more_vars_rec <- more_vars_rec %>%
  step_unknown(tags) %>%
  step_dummy(tags)
  • Now we add steps to process the text of the complaints, as before.
more_vars_rec <- more_vars_rec %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tfidf(consumer_complaint_narrative)
  • Let’s combine this more extensive preprocessing recipe that handles more variables together with the tuneable lasso regularized classification model specification.
more_vars_wf <- workflow() %>%
  add_recipe(more_vars_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

more_vars_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 8 Recipe Steps
## 
## * step_date()
## * step_rm()
## * step_dummy()
## * step_unknown()
## * step_dummy()
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Let’s tune this workflow() with our resampled data sets, find a good value for the regularization penalty, and estimate the model’s performance.

tic("process time") # About 11.5 min
set.seed(123)
more_vars_rs <- tune_grid(
  more_vars_wf,
  complaints_folds,
  grid = smaller_lambda,
)
toc()
# tic("process time") # About  1 min
# saveRDS(more_vars_rs,
#         "Output/classification_more_vars_rs.rds")
# toc()
tic("process time") # About  8.6 sec
more_vars_rs <- readRDS("Output/classification_more_vars_rs.rds")
toc()
## process time: 10.78 sec elapsed
  • We can extract the metrics for the best-performing regularization penalties from these results with show_best() with an option like “roc_auc” or “accuracy” if we prefer.

  • How did our chosen performance metric turn out for our model that included more than just the text data?

more_vars_rs %>%
  show_best("roc_auc") %>%
  mutate(penalty = round(penalty,5),
         mean = round(mean, 4),
         std_err = round(std_err,5)) 
  • We see here that including more predictors did not measurably improve our model performance or even change the regularization. With only text features in Section 7.5 and the same grid and sparse encoding, we achieved an accuracy of 0.953, the same as what we see now by including the features dealing with dates and tags as well.

  • The best regularization penalty in Section 7.5 was 0.0007 and is about the same here. We can use tidy() and some dplyr manipulation to find at what rank (term_rank) any of the date or tag variables were included in the regularized results, by absolute value of the model coefficient.