We consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau.
# packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom 0.7.9 v rsample 0.1.0
## v dials 0.0.10 v tune 0.1.6
## v infer 1.0.0 v workflows 0.2.4
## v modeldata 0.1.1 v workflowsets 0.1.0
## v parsnip 0.1.7 v yardstick 0.0.8
## v recipes 0.1.17
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(textrecipes)
# install.packages("discrim")
library(discrim)
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
library(tictoc)
# install.packages("naivebayes")
# install.packages("glmnet")
library(hardhat)
library(themis)
## Registered S3 methods overwritten by 'themis':
## method from
## bake.step_downsample recipes
## bake.step_upsample recipes
## prep.step_downsample recipes
## prep.step_upsample recipes
## tidy.step_downsample recipes
## tidy.step_upsample recipes
## tunable.step_downsample recipes
## tunable.step_upsample recipes
##
## Attaching package: 'themis'
## The following objects are masked from 'package:recipes':
##
## step_downsample, step_upsample
complaints <- read_csv("Datasets/complaints.csv.gz",
show_col_types = FALSE)
glimpse(complaints)
## Rows: 117,214
## Columns: 18
## $ date_received <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ product <chr> "Debt collection", "Credit reporting, cre~
## $ sub_product <chr> "I do not know", "Credit reporting", "I d~
## $ issue <chr> "Attempts to collect debt not owed", "Inc~
## $ sub_issue <chr> "Debt is not yours", "Information belongs~
## $ consumer_complaint_narrative <chr> "transworld systems inc. \nis trying to c~
## $ company_public_response <chr> NA, "Company has responded to the consume~
## $ company <chr> "TRANSWORLD SYSTEMS INC", "TRANSUNION INT~
## $ state <chr> "FL", "CA", "NC", "RI", "FL", "TX", "SC",~
## $ zip_code <chr> "335XX", "937XX", "275XX", "029XX", "333X~
## $ tags <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided <chr> "Consent provided", "Consent provided", "~
## $ submitted_via <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company <date> 2019-09-24, 2019-10-25, 2019-11-08, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id <dbl> 3384392, 3417821, 3433198, 3366475, 33853~
summary(complaints)
## date_received product sub_product issue
## Min. :2019-01-01 Length:117214 Length:117214 Length:117214
## 1st Qu.:2019-04-02 Class :character Class :character Class :character
## Median :2019-06-27 Mode :character Mode :character Mode :character
## Mean :2019-06-25
## 3rd Qu.:2019-09-16
## Max. :2020-01-09
## sub_issue consumer_complaint_narrative company_public_response
## Length:117214 Length:117214 Length:117214
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## company state zip_code tags
## Length:117214 Length:117214 Length:117214 Length:117214
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## consumer_consent_provided submitted_via date_sent_to_company
## Length:117214 Length:117214 Min. :2019-01-01
## Class :character Class :character 1st Qu.:2019-04-03
## Mode :character Mode :character Median :2019-06-27
## Mean :2019-06-25
## 3rd Qu.:2019-09-16
## Max. :2020-01-09
## company_response_to_consumer timely_response consumer_disputed
## Length:117214 Length:117214 Length:117214
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## complaint_id
## Min. :3113204
## 1st Qu.:3199571
## Median :3288781
## Mean :3288551
## 3rd Qu.:3375441
## Max. :3491200
For our first model, let’s build a binary classification model to predict whether a submitted complaint is about
unique(complaints$product)
## [1] "Debt collection"
## [2] "Credit reporting, credit repair services, or other personal consumer reports"
## [3] "Money transfer, virtual currency, or money service"
## [4] "Mortgage"
## [5] "Student loan"
## [6] "Credit card or prepaid card"
## [7] "Checking or savings account"
## [8] "Payday loan, title loan, or personal loan"
## [9] "Vehicle loan or lease"
This kind of “yes or no” binary classification model is both common and useful in real-world text machine learning problems.
It is always a good idea to look at your data! Here are the first six complaints:
head(complaints$consumer_complaint_narrative)
## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."
## [2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."
## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."
## [4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"
## [5] "While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information."
## [6] "I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX"
The complaint narratives contain many series of capital “X”’s.
These strings (like “XX/XX” or “XXXX XXXX XXXX XXXX”) are used to to protect personally identifiable information (PII) in this publicly available data set.
This is not a universal censoring mechanism; censoring and PII protection will vary from source to source.
Hopefully you will be able to find information on PII censoring in a data dictionary, but you should always look at the data yourself to verify.
We also see that monetary amounts are surrounded by curly brackets (like "{$21.00}"); this is another text preprocessing step that has been taken care of for us.
ccn <- complaints$consumer_complaint_narrative %>%
str_extract_all(pattern = "\\{\\$[0-9\\.]*\\}") %>%
compact() # discards elements that are NULL or that have length zero
head(ccn,3)
## [[1]]
## [1] "{$21.00}" "{$1.00}"
##
## [[2]]
## [1] "{$2300.00}"
##
## [[3]]
## [1] "{$200.00}" "{$5000.00}" "{$5000.00}" "{$770.00}" "{$800.00}"
## [6] "{$5000.00}"
This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable consumer_complaint_narrative.
Let’s create a factor outcome variable product with two levels,
“Credit” and
“Other.”
Then, we split the data into training and testing data sets.
We can use the initial_split() function from rsample to create this binary split of the data.
The strata argument ensures that the distribution of product is similar in the training set and testing set.
Since the split uses random sampling, we set a seed so we can reproduce our results.
Training and Testing Data Sets
# package 02
# library(tidymodels)
########
set.seed(1234)
complaints2class <- complaints %>%
mutate(product = factor(
if_else(
product == paste("Credit reporting, credit repair services,",
"or other personal consumer reports"),
"Credit",
"Other"
)))
names(complaints2class)
## [1] "date_received" "product"
## [3] "sub_product" "issue"
## [5] "sub_issue" "consumer_complaint_narrative"
## [7] "company_public_response" "company"
## [9] "state" "zip_code"
## [11] "tags" "consumer_consent_provided"
## [13] "submitted_via" "date_sent_to_company"
## [15] "company_response_to_consumer" "timely_response"
## [17] "consumer_disputed" "complaint_id"
unique(complaints2class$product)
## [1] Other Credit
## Levels: Credit Other
complaints_split <- initial_split(complaints2class, strata = product)
dim(complaints_split)
## analysis assessment n p
## 87910 29304 117214 18
complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)
scales::percent(nrow(complaints_train)/nrow(complaints_split$data))
## [1] "75%"
scales::percent(nrow(complaints_test)/nrow(complaints_split$data))
## [1] "25%"
## package 03
## library(textrecipes)
complaints_rec <-
recipe(product ~ consumer_complaint_narrative,
data = complaints_train) %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative)
names(complaints_rec)
## [1] "var_info" "term_info" "steps" "template" "levels" "retained"
workflow() to bundle together our modeling components.complaint_wf <- workflow() %>%
add_recipe(complaints_rec)
complaint_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: None
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
Let’s start with a naive Bayes model (S. Kim et al. 2006; Kibriya et al. 2005; Frank and Bouckaert 2006), which is available in the tidymodels package discrim.
In
tidymodels, the package for creating model specifications isparsnip(Kuhn and Vaughan 2021b). Theparsnippackage provides the functions for creating all the models we have used so far, but other extra packages provide more. Thediscrimpackage is an extension package forparsnipthat contains model definitions for various discriminant analysis models, includingnaive Bayes.
Model Specifications
## package05
## library(discrim)
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("naivebayes")
nb_spec
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
Now we have everything we need to fit our first classification model.
naive Bayes model to our workflow, andtic("process time") ## process time about 1 min
nb_fit <- complaint_wf %>%
add_model(nb_spec) %>%
fit(data = complaints_train)
toc()
## tic("process time") # process time about 40 mins
## saveRDS(nb_fit, "Output/regression_nb_fit.rds")
## toc()
tic("process time") # process time 7 secs
nb_fit <- readRDS("Output/classification_nb_fit.rds")
toc()
We have trained our first classification model!
We will use resampling methods to evaluate our model.
Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit.
We can do this using resampled data sets built from the training set.
Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.
set.seed(234)
complaints_folds <- vfold_cv(complaints_train)
glimpse(complaints_folds)
## Rows: 10
## Columns: 2
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[7911~
## $ id <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "Fo~
Each of these splits contains information about how to create cross-validation folds from the original training data.
In this example, 90% of the training data is included in each fold, and
the other 10% is held out for evaluation.
For convenience, let’s again use a workflow() for our resampling estimates of performance.
Using a
workflow()isn’t required (you can fit or tune a model plus a preprocessor), but it can make your code easier to read and organize.
nb_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(nb_spec)
nb_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
Now, to estimate how well that model performs, let’s fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.
tic("process time") # process time about 10 mins
nb_rs <- fit_resamples(
nb_wf,
complaints_folds,
control = control_resamples(save_pred = TRUE)
)
toc()
## tic("process time") ## process time about 1 min
## saveRDS(nb_rs, "Output/regression_nb_rs.rds")
## toc()
tic("process time") ## process time about 1 min
nb_rs <- readRDS("Output/classification_nb_rs.rds")
toc()
## process time: 11.19 sec elapsed
nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)
nb_rs_metrics
The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator characteristic curve).
For these resamples, the average accuracy is 80.2%.
Accuracy and ROC AUC are performance metrics used for classification models. For both, values closer to 1 are better.
Accuracy is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets.
ROC AUC measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing.
nb_rs_predictions %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "ROC curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)
Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition, in comparison to a ‘Gold Standard’ or definition.
Sensitivity (True Positive Rate) refers to the proportion of those who received a positive result on this test out of those who actually have the condition (when judged by the ‘Gold Standard’).
Specificity (True Negative Rate) refers to the proportion of those who received a negative result on this test out of those who do not actually have the condition (when judged by the ‘Gold Standard’).
The area under each of these curves is the roc_auc metric we have computed.
Another way to evaluate our model is to evaluate the confusion matrix.
conf_mat_resampled() computes a separate confusion matrix for each resample and takes the average of the cell counts.conf_mat_resampled(nb_rs, tidy = FALSE)
## Credit Other
## Credit 2864.7 438.2
## Other 1302.1 4186.0
conf_mat_resampled(nb_rs, tidy = FALSE) %>%
autoplot(type = "heatmap")
In the figure the squares for “Credit”/“Credit” and “Other”/“Other” have a darker shade than the off-diagonal squares.
One metric alone cannot give you a complete picture of how well your classification model is performing. The confusion matrix is a good starting point to get an overview of your model performance, as it includes rich information.
We can build a classification null_model() specification and add it to a workflow() with the same preprocessing recipe we used in the previous section, to estimate performance.
null_classification <- null_model() %>%
set_engine("parsnip") %>%
set_mode("classification")
tic("process time") ## process time is about 9 minutes
null_rs <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(null_classification) %>%
fit_resamples(
complaints_folds
)
toc()
## tic("process time") ## process time is about 1 min
## saveRDS(null_rs, "Output/regression_null_rs.rds")
## toc()
tic("process time") ## process time is about 9 min
null_rs <- readRDS("Output/classification_null_rs.rds")
toc()
## process time: 11.06 sec elapsed
null_rs %>%
collect_metrics()
The accuracy and ROC AUC indicate that this null model is dramatically worse than even our first model.
The text of the CFPB complaints is predictive relative to the category we are building models for.
Regularized linear models are a class of statistical model that can be used in regression and classification tasks.
Linear models are not considered cutting edge in NLP research, but are a workhorse in real-world practice.
Here we will use a lasso regularized model (Tibshirani 1996), where the regularization method also performs variable selection.
LASSO is Least Absolute Shrinkage and Selection operator.
It was introduced in order to improve the prediction accuracy and interpretability of regression models.
It was originally introduced in geophysics and later by Robert Tibshirani who coined the term.
In text analysis, we typically have many tokens, which are the features in our machine learning problem.
Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.
Lasso regression or classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
Let’s create a specification of a lasso regularized model. Remember that in tidymodels, specifying a model has three components: the algorithm, the mode, and the computational engine.
Video: Robust, Interpretable Statistical Models: Sparse Regression with the LASSO
lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) %>% # lasso : `mixture = 1`
set_mode("classification") %>%
set_engine("glmnet")
lasso_spec
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.01
## mixture = 1
##
## Computational engine: glmnet
Then we can create anotherworkflow() object with the lasso specification. Notice that we can reuse our text preprocessing recipe.
lasso_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(lasso_spec)
lasso_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.01
## mixture = 1
##
## Computational engine: glmnet
Now we estimate the performance of this first lasso classification model with fit_resamples().
tic("process_time") # process time is about 22 minutes
set.seed(2020)
lasso_rs <- fit_resamples(
lasso_wf,
complaints_folds,
control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") ## process time is about 1 min
# saveRDS(lasso_rs, "Output/classification_lasso_rs.rds")
# toc()
tic("process time") ## process time is about 9 sec
lasso_rs <- readRDS("Output/classification_lasso_rs.rds")
toc()
## process time: 10.69 sec elapsed
collect_metrics() and collect_predictions()lasso_rs_metrics <- collect_metrics(lasso_rs)
lasso_rs_predictions <- collect_predictions(lasso_rs)
lasso_rs_metrics
This looks pretty promising, considering we haven’t yet done any tuning of the lasso hyperparameters.
The figure below shows the ROC curves for this regularized model on each of the resampled data sets.
lasso_rs_predictions %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "ROC curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)
conf_mat_resampled(lasso_rs, tidy = FALSE)
## Credit Other
## Credit 3421.6 401.8
## Other 745.2 4222.4
conf_mat_resampled(lasso_rs, tidy = FALSE) %>%
autoplot(type = "heatmap")
The value penalty = 0.01 for regularization in the previous was picked somewhat arbitrarily.
How do we know the right or best regularization parameter penalty?
This is a model hyperparameter, and we cannot learn its best value during model training, but we can estimate the best value by training many models on resampled data sets and exploring how well all these models perform.
Let’s build a new model specification for model tuning.
tune_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
tune_spec
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Think of
tune()here as a placeholder for the regularization penalty.
penalty().lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
The function grid_regular() is from the dials package.
It chooses sensible values to try for a parameter like the regularization penalty; here, we asked for 30 different possible values.
Let’s use tune_grid() to fit a model at each of the values for the regularization penalty in our regular grid.
n tidymodels, the package for tuning is called
tune. Tuning a model uses a similar syntax compared to fitting a model to a set of resampled data sets for the purposes of evaluation (fit_resamples()) because the two tasks are so similar. The difference is that when you tune, each model that you fit has different parameters and you want to find the best one.
Resampling with Tuning
tune_spec to a workflow with the same preprocessing recipe we’ve been using so far, and then fit it to every possible parameter in lambda_grid and every resample in complaints_folds with tune_grid().tune_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(tune_spec)
tic("process time") # about 22 min
set.seed(2020)
tune_rs <- tune_grid(
tune_wf,
complaints_folds,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") # about 1 min
# saveRDS(tune_rs, "Output/classification_tune_rs.rds")
# toc()
tic("process time") # about 10 secs
tune_rs <- readRDS("Output/classification_tune_rs.rds")
toc()
## process time: 11.4 sec elapsed
glimpse(tune_rs)
## Rows: 10
## Columns: 5
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics <list> [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]>], [<tbl_df[60 x 5]~
## $ .notes <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>],~
## $ .predictions <list> [<tbl_df[263730 x 7]>], [<tbl_df[263730 x 7]>], [<tbl_df~
collect_metrics(tune_rs)
autoplot(tune_rs) +
labs(
title = "Lasso model performance across regularization penalties",
subtitle = "Performance metrics can be used to identity the best penalty"
)
show_best() and a choice for the metric, such as ROC AUC.tune_rs %>%
show_best("roc_auc") %>%
mutate(mean = round(mean, 4))
The best value for ROC AUC from this tuning run is 0.953.
We can extract the best regularization parameter for this value of ROC AUC from our tuning results with select_best(), or a simpler model with higher regularization with select_by_pct_loss() or select_by_one_std_err()
Let’s choose the model with the best ROC AUC within one standard error of the numerically best model (Breiman et al. 1984).
chosen_auc <- tune_rs %>%
select_by_one_std_err(metric = "roc_auc", -penalty) %>%
mutate(mean = round(mean, 4))
chosen_auc
tune_rs %>%
select_best(metric = "roc_auc")
tune_rs %>%
select_best(metric = "accuracy")
This is the regularization penalty that our tuning results indicate give us the best model.
final_lasso <- finalize_workflow(tune_wf, chosen_auc)
final_lasso
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.000788046281566992
## mixture = 1
##
## Computational engine: glmnet
Instead of penalty = tune() like before, now our workflow has finalized values for all arguments.
The preprocessing recipe has been evaluated on the training data, and we tuned the regularization penalty so that we have a penalty value of 0.00079.
This workflow is ready to go! It can now be fit to our training data.
tic("process time") # process time is about 2 min
fitted_lasso <- fit(final_lasso, complaints_train)
toc()
# tic("process time") # About 34 secs
# saveRDS(fitted_lasso, "Output/classification_fitted_lasso.rds")
# toc()
tic("process time") # about 5 sec
fitted_lasso <- readRDS("Output/classification_fitted_lasso.rds")
toc()
## process time: 7.44 sec elapsed
fitted_lasso.tidy <-
fitted_lasso %>%
pull_workflow_fit() %>%
tidy() %>%
mutate(estimate = round(estimate,1)) %>%
select(-penalty)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-2
#####
head(fitted_lasso.tidy)
for the penalty of 0.0007880463, the terms that contribute the most to a complaint not being about credit.
The words are largely about mortgages and other financial products.
# not about credit
fitted_lasso.tidy %>%
filter(estimate > 10) %>%
arrange(-estimate) %>%
separate(term, c(NA,NA,NA,NA, "term")) %>%
knitr:: kable()
| term | estimate |
|---|---|
| funds | 27.6 |
| appraisal | 22.9 |
| escrow | 21.0 |
| bonus | 20.7 |
| debt | 18.5 |
| emailed | 16.4 |
| money | 16.1 |
| interest | 15.7 |
| afford | 15.5 |
| fees | 14.9 |
| merchant | 14.9 |
| branch | 14.6 |
| cash | 14.5 |
| collect | 14.4 |
| servicer | 13.8 |
| fee | 13.4 |
| transfer | 13.0 |
| refusing | 12.6 |
| purchases | 12.3 |
| affidavit | 12.2 |
| fdcpa | 11.9 |
| refund | 11.8 |
| thought | 11.3 |
| paypal | 11.2 |
| processing | 11.2 |
| bank | 10.9 |
| calling | 10.8 |
| title | 10.8 |
| purchased | 10.7 |
| closing | 10.5 |
| at | 10.3 |
| debit | 10.3 |
| customers | 10.2 |
| deposited | 10.2 |
fitted_lasso.tidy %>%
filter(estimate < -10) %>%
arrange(estimate) %>%
separate(term, c(NA,NA,NA,NA, "term")) %>%
knitr:: kable()
| term | estimate |
|---|---|
| reseller | -90.9 |
| experian | -56.9 |
| transunion | -50.1 |
| equifax | -48.1 |
| compliant | -23.7 |
| reporting | -21.1 |
| freeze | -20.9 |
| inquiries | -19.0 |
| report | -18.6 |
| method | -16.3 |
| inquiry | -15.0 |
| physically | -15.0 |
| delinquency | -14.5 |
| investigation | -14.1 |
| late | -14.0 |
| accuracy | -13.9 |
| delinquent | -13.9 |
| score | -13.2 |
| scores | -12.5 |
| credit | -12.2 |
| run | -12.2 |
| investigate | -12.1 |
| accounts | -12.0 |
| items | -11.9 |
| companies | -11.4 |
| dropped | -11.3 |
| correct | -11.1 |
| addresses | -11.0 |
| corrected | -10.8 |
| creditors | -10.5 |
| compliance | -10.4 |
| record | -10.3 |
| verifiable | -10.1 |
Since we are using a linear model, the model coefficients are directly interpretable and transparently give us variable importance. Many models useful for machine learning with text do not have such transparent variable importance; in those situations, you can use other model-independent or model-agnostic approaches like permutation variable importance.
We can change how our text data is represented to take advantage of its sparsity, especially for models like lasso regularized models.
The regularized regression model we have been training in previous sections used set_engine(“glmnet”); this computational engine can be more efficient when text data is transformed to a sparse matrix (Section 5.1), rather than a dense dataframe or tibble representation.
To keep our text data sparse throughout modeling and use the sparse capabilities of set_engine(“glmnet”), we need to explicitly set a non-default preprocessing blueprint, using the package hardhat (Vaughan and Kuhn 2020).
The
hardhatpackage is used by other tidymodels packages like recipes and parsnip under the hood. As a tidymodels user, you typically don’t use hardhat functions directly. The exception is when you need to customize something about your model or preprocessing, like in this sparse data example.
# package06
# library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")
This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model.
The composition “dgCMatrix” is the most common sparse matrix type, from the Matrix package (Bates and Maechler 2021), used in R for modeling.
We can use this blueprint argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.
sparse_wf <- workflow() %>%
add_recipe(complaints_rec, blueprint = sparse_bp) %>%
add_model(tune_spec)
sparse_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels.
Let’s restrict the values this time using the range argument, so we don’t test out as small values for regularization, and only try 20 levels.
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20)
smaller_lambda %>%
mutate(penalty = format(penalty, scientific=FALSE)) %>%
knitr::kable()
| penalty |
|---|
| 0.00001000000 |
| 0.00001832981 |
| 0.00003359818 |
| 0.00006158482 |
| 0.00011288379 |
| 0.00020691381 |
| 0.00037926902 |
| 0.00069519280 |
| 0.00127427499 |
| 0.00233572147 |
| 0.00428133240 |
| 0.00784759970 |
| 0.01438449888 |
| 0.02636650899 |
| 0.04832930239 |
| 0.08858667904 |
| 0.16237767392 |
| 0.29763514416 |
| 0.54555947812 |
| 1.00000000000 |
tic("process time") # about 12 minutes
set.seed(2020)
sparse_rs <- tune_grid(
sparse_wf,
complaints_folds,
grid = smaller_lambda
)
toc()
# tic("process time") # process time about 1 min and 15 secs
# saveRDS(sparse_rs, "Output/classification_sparse_rs.rds")
# toc()
tic("process time") # process time about 11 secs
sparse_rs <- readRDS("Output/classification_sparse_rs.rds")
toc()
## process time: 10.86 sec elapsed
glimpse(sparse_rs)
## Rows: 10
## Columns: 4
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_split[79~
## $ id <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06", "~
## $ .metrics <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], ~
## $ .notes <list> [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[0 x 1]>], [<t~
set_engine("glmnet")?sparse_rs %>%
show_best("roc_auc") %>%
mutate(penalty = round(penalty, 6),
mean = round(mean, digits=4),
std_err = round(std_err, digits = 6))
The best ROC AUC is nearly identical;
the best ROC AUC for the non-sparse tuned lasso model in Section 7.4 was 0.953.
The best regularization parameter (penalty) is a little different (the best value in Section 7.4 was 0.00079), but we used a different grid so didn’t try out exactly the same values.
We ended up with nearly the same performance and best tuned model.
Importantly, this tuning also took a bit less time to complete.
The preprocessing was not much faster, because tokenization and computing tf-idf take a long time.
The model fitting was much faster, because for highly sparse data, this implementation of regularized regression is much faster for sparse matrix input than any dense input.
Overall, the whole tuning workflow is about 10% faster using the sparse preprocessing blueprint.
Depending on how computationally expensive your preprocessing is relative to your model and how sparse your data is, you may expect to see larger (or smaller) gains from moving to a sparse data representation.
Since our model performance is about the same and we see gains in training time, let’s use this sparse representation for the rest of this chapter.
Most of this chapter focuses on binary classification, where we have two classes in our outcome variable (such as “Credit” and “Other”) and each observation can either be one or the other.
This is a simple scenario with straightforward evaluation strategies because the results only have a two-by-two contingency matrix.
However, it is not always possible to limit a modeling question to two classes.
Let’s explore how to deal with situations where we have more than two classes.
The CFPB complaints data set in this chapter has nine different product classes.
In decreasing frequency, they are:
Credit reporting, credit repair services, or other personal consumer reports
Debt collection
Credit card or prepaid card
Mortgage
Checking or savings account
Student loan
Vehicle loan or lease
Money transfer, virtual currency, or money service
Payday loan, title loan, or personal loan
We assume that there is a reason why these product classes have been created in this fashion by this government agency.
Perhaps complaints from different classes are handled by different people or organizations.
Whatever the reason, in this section we would like to build a multiclass classifier to identify these nine specific product classes.
We need to create a new split of the data using initial_split() on the unmodified complaints data set.
set.seed(1234)
multicomplaints_split <- initial_split(complaints, strata = product)
dim(multicomplaints_split)
## analysis assessment n p
## 87910 29304 117214 18
multicomplaints_train <- training(multicomplaints_split)
multicomplaints_test <- testing(multicomplaints_split)
multicomplaints_train %>%
count(product, sort = TRUE) %>%
select(n, product) %>%
knitr::kable()
| n | product |
|---|---|
| 41678 | Credit reporting, credit repair services, or other personal consumer reports |
| 16688 | Debt collection |
| 8689 | Credit card or prepaid card |
| 7117 | Mortgage |
| 5219 | Checking or savings account |
| 2927 | Student loan |
| 2030 | Vehicle loan or lease |
| 1906 | Money transfer, virtual currency, or money service |
| 1656 | Payday loan, title loan, or personal loan |
When you have multiple classes in your data, it is possible to formulate the multiclass problem in two ways.
With one approach, any given observation can belong to multiple classes.
With the other approach, an observation can belong to one and only one class.
We will be sticking to the second, “one class per observation” model formulation in this section.
There are many different ways to deal with imbalanced data.
We will demonstrate one of the simplest methods, downsampling, where observations from the majority classes are removed during training to achieve a balanced class distribution.
We will be using the themis (Hvitfeldt 2020d) add-on package for recipes which provides the step_downsample() function to perform downsampling.
The
themispackage provides many more algorithms to deal with imbalanced data during data preprocessing.
We have to create a new recipe specification from scratch, since we are dealing with new training data this time.
The specification multicomplaints_rec is similar to what we created in Section 7.1.
The only changes are that different data is passed to the data argument in the recipe() function (it is now multicomplaints_train) and we have added step_downsample(product) to the end of the recipe specification to downsample after all the text preprocessing.
We want to downsample last so that we still generate features on the full training data set.
The downsampling will then only affect the modeling step, not the preprocessing steps, with hopefully better results.
# package07
# library(themis)
multicomplaints_rec <-
recipe(product ~ consumer_complaint_narrative,
data = multicomplaints_train) %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative) %>%
step_downsample(product)
We also need a new cross-validation object since we are using a different data set.
multicomplaints_folds <- vfold_cv(multicomplaints_train)
We cannot reuse the tuneable lasso classification specification from Section 7.4 because it only works for binary classification.
Some model algorithms and computational engines (examples are most random forests and SVMs) automatically detect when we perform multiclass classification from the number of classes in the outcome variable and do not require any changes to our model specification.
For lasso regularization, we need to create a new special model specification just for the multiclass class using multinom_reg().
multi_spec <- multinom_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
multi_spec
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
We used the same arguments for penalty and mixture as in Section 7.4, as well as the same mode and engine, but this model specification is set up to handle more than just two classes.
We can combine this model specification with our preprocessing recipe for multiclass data in a workflow().
multi_lasso_wf <- workflow() %>%
add_recipe(multicomplaints_rec, blueprint = sparse_bp) %>%
add_model(multi_spec)
multi_lasso_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: multinom_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## * step_downsample()
##
## -- Model -----------------------------------------------------------------------
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Now we have everything we need to tune the regularization penalty and find an appropriate value.
Note that we specify save_pred = TRUE, so we can create ROC curves and a confusion matrix later. This is especially beneficial for multiclass classification.
tic("process time") # About 1 hr
multi_lasso_rs <- tune_grid(
multi_lasso_wf,
multicomplaints_folds,
grid = smaller_lambda,
control = control_resamples(save_pred = TRUE)
)
toc()
# tic("process time") # About 1 min and 7 secs
# saveRDS(multi_lasso_rs, "Output/classification_multi_lasso_rs.rds")
# toc()
tic("process time") # About 10 secs
multi_lasso_rs <- readRDS("Output/classification_multi_lasso_rs.rds")
toc()
## process time: 13.02 sec elapsed
glimpse(multi_lasso_rs)
## Rows: 10
## Columns: 5
## $ splits <list> [<vfold_split[79119 x 8791 x 87910 x 18]>], [<vfold_spli~
## $ id <chr> "Fold01", "Fold02", "Fold03", "Fold04", "Fold05", "Fold06~
## $ .metrics <list> [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]>], [<tbl_df[40 x 5]~
## $ .notes <list> [<tbl_df[1 x 1]>], [<tbl_df[0 x 1]>], [<tbl_df[1 x 1]>],~
## $ .predictions <list> [<tbl_df[175820 x 14]>], [<tbl_df[175820 x 14]>], [<tbl_~
best_acc <- multi_lasso_rs %>%
show_best("accuracy")
best_acc %>%
mutate(penalty = round(penalty,5),
mean = round(mean, 3),
std_err = round(std_err, 5))
In binary classification, there is one right answer and one wrong answer; in this multiclass case, there is one right answer and eight wrong answers.
tic("process time") About 20 seconds
conf_mat_ml_rs <-
multi_lasso_rs %>%
collect_predictions() %>%
filter(penalty == best_acc$penalty) %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class)
toc()
# saveRDS(conf_mat_ml_rs, "Output/classification_conf_mat_ml_rs.rds")
conf_mat_ml_rs <- readRDS("Output/classification_conf_mat_ml_rs.rds")
png(filename = "Images/classification_conf_mat_ml_rs.png",
height = 800, width = 1500, res = 144)
conf_mat_ml_rs %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()
## png
## 2
The diagonal is fairly well populated, which is a good sign.
This means that the model generally predicted the right class.
The off-diagonal numbers are all the failures and where we should direct our focus.
It is a little hard to see these cases well since the majority class affects the scale.
A trick to deal with this problem is to remove all the correctly predicted observations.
tic("process time")
conf_mat_ml_rs2 <-
multi_lasso_rs %>%
collect_predictions() %>%
filter(penalty == best_acc$penalty) %>%
filter(id == "Fold01") %>%
filter(.pred_class != product) %>%
conf_mat(product, .pred_class)
toc()
# saveRDS(conf_mat_ml_rs2, "Output/classification_conf_mat_ml_rs2.rds")
conf_mat_ml_rs2 <- readRDS("Output/classification_conf_mat_ml_rs2.rds")
png("Images/classification_conf_mat_ml_rs2.png",
height = 800, width = 1500, res = 144)
conf_mat_ml_rs2 %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
dev.off()
## png
## 2
Now we can more clearly see where our model breaks down in the figure above.
Some of the most common errors are “Credit reporting, credit repair services, or other personal consumer reports” complaints being wrongly being predicted as “Debt collection” or “Credit card of prepaid card” complaints.
Those mistakes by the model are not hard to understand since all deal with credit and debt and do have overlap in vocabulary.
Knowing what the problem is helps us figure out how to improve our model.
The next step for improving our model is to revisit the data preprocessing steps and model selection.
We can look at different models or model engines that might be able to more easily separate the classes.
Now that we have an idea of where the model isn’t working, we can look more closely at the data and attempt to create features that could distinguish between these classes.
In Section 7.9 we will demonstrate how you can create your own custom features.
We are building a model from a data set that includes more than text data alone.
Annotations and labels have been added by the CFPB that we can use during modeling, but we need to ensure that only information that would be available at the time of prediction is included in the model.
Otherwise we will be very disappointed once our model is used to predict on new data!
The variables we identify as available for use as predictors are:
date_received
issue
sub_issue
consumer_complaint_narrative
company
state
zip_code
tags
submitted_via
Let’s try including date_received in our modeling, along with the text variable we have already used, consumer_complaint_narrative, and a new variable tags.
The submitted_via variable could have been a viable candidate, but all the entries are “web.”
The other variables like ZIP code could be of use too, but they are categorical variables with many values so we will exclude them for now.
more_vars_rec <-
recipe(product ~ date_received + tags + consumer_complaint_narrative,
data = complaints_train)
*We can use thestep_date()` function to extract the month and day of the week (“dow”).
step_dummy().Categorical variables like the month can be stored as strings or factors, but for some kinds of models, they must be converted to indicator or dummy variables. These are numeric binary variables for the levels of the original categorical variable. For example, a variable called
Decemberwould be created that is all zeroes and ones specifying which complaints were submitted in December, plus a variable calledNovember, a variable calledOctober, and so on.
glimpse(complaints_train)
## Rows: 87,910
## Columns: 18
## $ date_received <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ product <fct> Credit, Credit, Credit, Credit, Credit, C~
## $ sub_product <chr> "Credit reporting", "Credit reporting", "~
## $ issue <chr> "Incorrect information on your report", "~
## $ sub_issue <chr> "Account information incorrect", "Their i~
## $ consumer_complaint_narrative <chr> "I would like the credit bureau to correc~
## $ company_public_response <chr> "Company has responded to the consumer an~
## $ company <chr> "TRANSUNION INTERMEDIATE HOLDINGS, INC.",~
## $ state <chr> "TX", "NC", "NC", "CO", "WI", "CA", "FL",~
## $ zip_code <chr> "77004", "285XX", "276XX", "80920", NA, N~
## $ tags <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ consumer_consent_provided <chr> "Consent provided", "Consent provided", "~
## $ submitted_via <chr> "Web", "Web", "Web", "Web", "Web", "Web",~
## $ date_sent_to_company <date> 2019-11-20, 2019-04-17, 2019-10-24, 2019~
## $ company_response_to_consumer <chr> "Closed with explanation", "Closed with e~
## $ timely_response <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",~
## $ consumer_disputed <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ complaint_id <dbl> 3444592, 3214857, 3417374, 3420108, 33801~
more_vars_rec <- more_vars_rec %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received) %>%
step_dummy(has_role("dates"))
The tags variable has some missing data.
We can deal with this by using step_unknown(), which adds a new level to this factor variable for cases of missing data.
Then we “dummify” (create dummy/indicator variables) the variable with step_dummy().
sum(is.na(complaints_train$tags))
## [1] 73192
more_vars_rec <- more_vars_rec %>%
step_unknown(tags) %>%
step_dummy(tags)
more_vars_rec <- more_vars_rec %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative)
more_vars_wf <- workflow() %>%
add_recipe(more_vars_rec, blueprint = sparse_bp) %>%
add_model(tune_spec)
more_vars_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 8 Recipe Steps
##
## * step_date()
## * step_rm()
## * step_dummy()
## * step_unknown()
## * step_dummy()
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Let’s tune this workflow() with our resampled data sets, find a good value for the regularization penalty, and estimate the model’s performance.
tic("process time") # About 11.5 min
set.seed(123)
more_vars_rs <- tune_grid(
more_vars_wf,
complaints_folds,
grid = smaller_lambda,
)
toc()
# tic("process time") # About 1 min
# saveRDS(more_vars_rs,
# "Output/classification_more_vars_rs.rds")
# toc()
tic("process time") # About 8.6 sec
more_vars_rs <- readRDS("Output/classification_more_vars_rs.rds")
toc()
## process time: 10.78 sec elapsed
We can extract the metrics for the best-performing regularization penalties from these results with show_best() with an option like “roc_auc” or “accuracy” if we prefer.
How did our chosen performance metric turn out for our model that included more than just the text data?
more_vars_rs %>%
show_best("roc_auc") %>%
mutate(penalty = round(penalty,5),
mean = round(mean, 4),
std_err = round(std_err,5))
We see here that including more predictors did not measurably improve our model performance or even change the regularization. With only text features in Section 7.5 and the same grid and sparse encoding, we achieved an accuracy of 0.953, the same as what we see now by including the features dealing with dates and tags as well.
The best regularization penalty in Section 7.5 was 0.0007 and is about the same here. We can use tidy() and some dplyr manipulation to find at what rank (term_rank) any of the date or tag variables were included in the regularized results, by absolute value of the model coefficient.