Modeling Data in the Tidyverse Course Project

Author

Affiliations

Francisco Pablo Huascar Aragão Pinheiro

Tidyverse Skills for Data Science in R from Coursera

Universidade Federal do Ceará

In this course, we have learned about modeling data in the Tidyverse in R. This project will give you the opportunity to practice those skills in greater depth.

Here, we will continue to use consumer complaints data from the Consumer Complaint Database (CFPB) that was used in the Wranging Data in the Tidyverse Course Project. The CFPB is an independent agency of the United States government that promotes transparency and protects consumers by providing information needed to make decisions when choosing financial institutions including banking institutions, lenders, mortgage services, credit unions, securities firms, foreclosure services, and debt collectors. One of the purposes of the agency is to receive and process complaints and questions about consumer financial products and services.

When a complaint is submitted by a consumer, the CFPB has to determine which category the complaint falls in (e.g. “Mortgage”, “Student loan”, etc). In this project, your goal will be to use the skills you have learned about in this course to build a classification algorithm to classify consumer complaints into one of four categories: “Credit card or prepaid card”, “Mortgage”, “Student loan”, or “Vehicle loan or lease”.

Data In the assignment, two datasets have been prepared for you:

Training data – complaints from CFPB (data_complaints_train.csv) [134Mb]

The goal of your project is to predict the consumer complaint category. This is the “Product” variable in the training data set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Testing data – complaints from CFPB (data_complaints_test.csv) [37Kb]

“Text Mining with R: A Tidy Approach” (Silge and Robinson 2019) and “Supervised Machine Learning for Text Analysis in R” (Hvitfeldt and Silge 2021) are excellent guides to learning how to solve this task.

For the exploratory part, I recommend reading chapters 1 and 3 of the first book. I suggest chapter 7 of the second book to learn how to build the regression model.

Packages Used

library(tidyverse)
library(janitor)
library(tidytext)
library(textrecipes)
library(tidymodels)
library(hardhat)
library(themis)
library(glmnet)

Getting the data

The link available on the Coursera website is updated periodically. It is possible that the following links may not work.

if(!file.exists("./data")){dir.create("./data")}

url_train <- "https://d3c33hcgiwev3.cloudfront.net/JhHJz2SSRCqRyc9kkgQqxA_8d34147955154de4a6176086946d07b3_data_complaints_train.csv?Expires=1682640000&Signature=JyDQC3nZ6bqLT8ucu9q-4NPNatr2GrHMRsxi49IxE131SJXBWpqyfYEUwZ2ggQ48ZpbR4JXYIxFg2QWBv7ku-6aO8fz7Oh9FV8cn306XwCEGflwq8~ih7cH6gXYTNcHBiMqNy72DnVmNxmpi5ubSe3Vjcf7kQxvPyG12kDYbWFY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
  

destfile_train <- 
  destfile <- paste("./data/train.csv")


download.file(url_train,
              destfile_train)

url_test <- "https://d3c33hcgiwev3.cloudfront.net/aEBWUxehSGyAVlMXoThsoQ_edf53641edca416fa00a78d9e4b16ced_data_complaints_test.csv?Expires=1682380800&Signature=SuzIXp8RiKlAYoIG6bQDgvmNOFRZSHSgIqophzbhFQSqXd3ocYi~AOn3YbNVuHyDEI8WTI5jPnYKYvRL~mTcEWmeszALTXZUi8qKfHXbC67iwScmn9Lmam4axOXpLOxlaYjFJolVZd1x2MpP8IEJt1hGY~A5~ZWc6T6aFVs7KkI_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
  

destfile_test <- 
  destfile <- paste("./data/test.csv")

download.file(url_test,
              destfile_test)

To facilitate data manipulation, we used the janitor package to clean up variable names, and we also renamed the consumer_complaint_narrative variable to complaint.

df_train <-
  read_csv("./data/train.csv") |> 
  clean_names() |> 
  select(product,
         consumer_complaint_narrative) |> 
  rename("complaint" = consumer_complaint_narrative)

An initial look at the data shows that we have four product categories: Credit card or prepaid card, Mortgage, Student loan, and Vehicle loan or lease.

df_train |> 
  distinct(product)

# A tibble: 4 × 1
  product                    
  <chr>                      
1 Credit card or prepaid card
2 Mortgage                   
3 Student loan               
4 Vehicle loan or lease

In the variable that contains the complaints, strings like “XX/XX” or “XXXX XXXX XXXX XXXX” are used to protect personally identifiable information.

Monetary values are surrounded by curly brackets “{$10.00}”.

df_train |> 
  select(complaint) |> 
  slice_head(n = 20)

# A tibble: 20 × 1
   complaint                                                                    
   <chr>                                                                        
 1 "I initially in writing to Chase Bank in late XXXX about a charge that was u…
 2 "My ex husband and myself had a mobile home ( Home Mortgage ) with XXXX XXXX…
 3 "I was a student at XXXX XXXX from XX/XX/XXXX-XX/XX/XXXX. I accumulated XXXX…
 4 "It has come to my attention the CITI group is actively attempting to interf…
 5 "This banks new firearm policies run counter to laws and regulations passed …
 6 "I only use my walmart store card to keep from having it cancelled. My sons …
 7 "Complaint : Charged {$10.00} Finance fee for making a principal Payment Acc…
 8 "CALLED AND DEMNDED PAYMENTS BEFORE THEY WERE DUE BY 15 DAYS. FALSE CREDIT L…
 9 "In XX/XX/2019, I refinanced my office building loan with US Bank. At the ti…
10 "I began a vehicle loan with Credit Acceptance aroundXX/XX/XXXX. I have neve…
11 "Loan Number:XXXX  Bank of America Home Loans From : XXXX XXXX   XXXX XXXX X…
12 "I provide XXXX XXXX  and get paid via credit card. Before I accepted the pa…
13 "My student loans are currently being services by Pennsylvania Higher Educat…
14 "I called card.com to find out where my money went because I checked my acco…
15 "About 6 years ago ... due to the fact that I am at poverty level ... I shop…
16 "Due to a late payment in XX/XX/XXXX ( the only late payment I have had in r…
17 "On XX/XX/XXXX I attempted to use my XXXX XXXX ( Social Security XXXX   bene…
18 "I had a Merrill Lynch credit card issued by Bank of America closed without …
19 "Everytime I get ready to use the card it works then after one purchase it d…
20 "This is the letter we just mailed to ALLY Financial. \r\n\r\nAs I am typing…

Exploratory Data Analysis - EDA

For the EDA, we will clean the complaint variable by removing dates (“..\/..\/….”), monetary values (“\{\$[0-9\.]*\}”), credit card numbers, and other censored data (“X(X+)”), numbers (“[0-9]+”), etc.

complaint_clean <-
  df_train |> 
  drop_na(complaint) |>
  mutate(complaint = str_remove_all(complaint, "..\\/..\\/...."),
         complaint = str_remove_all(complaint, "\\{\\$[0-9\\.]*\\}"),
         complaint = str_remove_all(complaint, "X(X+)"),
         complaint = str_remove_all(complaint, "x(x+)"),
         complaint = str_remove_all(complaint,"[0-9]+"),
         complaint = str_remove_all(complaint, "\n_*\n"),
         complaint = str_remove_all(complaint, "\\b\\w{13,}\\b"),
         complaint = str_remove_all(complaint, "(_+\\s?)+"))

At first glance, the most frequent words are not important (“the”, “to”, “and”).

complaint_words <- 
  complaint_clean |>
  unnest_tokens(word,complaint) |>
  count(product,
        word)

complaint_words |> 
  arrange(desc(n))

# A tibble: 106,083 × 3
   product                     word       n
   <chr>                       <chr>  <int>
 1 Mortgage                    the   441634
 2 Credit card or prepaid card the   399102
 3 Credit card or prepaid card i     369168
 4 Mortgage                    to    358105
 5 Mortgage                    i     304823
 6 Credit card or prepaid card to    303243
 7 Mortgage                    and   274435
 8 Credit card or prepaid card and   244365
 9 Mortgage                    a     190193
10 Credit card or prepaid card my    184536
# ℹ 106,073 more rows

We will use stop_words with the anti_join function to remove these unimportant words. The words that now appear more frequently show a relationship with the complaints (“card”, “credit”, “loan”, etc.).

complaint_words_clean <-
  complaint_words |> 
  anti_join(stop_words)

Joining with `by = join_by(word)`

complaint_words_clean |> 
  arrange(desc(n))

# A tibble: 103,290 × 3
   product                     word         n
   <chr>                       <chr>    <int>
 1 Credit card or prepaid card card     91834
 2 Credit card or prepaid card credit   86714
 3 Mortgage                    mortgage 72805
 4 Mortgage                    loan     72284
 5 Credit card or prepaid card account  69192
 6 Mortgage                    payment  56609
 7 Credit card or prepaid card payment  34077
 8 Credit card or prepaid card bank     32022
 9 Mortgage                    payments 31399
10 Student loan                loan     31018
# ℹ 103,280 more rows

The graph displays the 15 most frequent words separated by product type and provides various information on the context of the complaints.

complaint_words_clean |>
  group_by(product) |> 
  slice_max(n, n = 15) |>  
  ungroup() |> 
  ggplot(
    aes(n,fct_reorder(word, n),
        fill = product)
  ) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ product, 
             ncol = 2, 
             scales = "free") +
  labs(x = "n", 
       y = NULL)

However, simply looking at word frequency is not the best method for understanding a text. Let’s use the term’s TF-IDF instead.

TF-IDF is a natural language processing technique that aims to measure the importance of each word in a text. The acronym stands for “Term Frequency-Inverse Document Frequency”. The idea is that words that appear frequently in a particular text but rarely in others are more important for the meaning of that text.

The complaint_tf_idf function is an implementation of the TF-IDF technique in R, developed to assist in analyzing a set of customer complaints in a company. The function calculates the term frequency in each document (complaint) and then adjusts this frequency by the frequency of occurrence in the entire set of documents. This helps to identify the most relevant terms in each complaint and also helps to determine which words are most important for the company as a whole.

complaint_tf_idf <- 
  complaint_words |> 
  bind_tf_idf(word,
              product, 
              n)

complaint_tf_idf |> 
  arrange(desc(tf_idf))

# A tibble: 106,083 × 6
   product                     word           n       tf   idf   tf_idf
   <chr>                       <chr>      <int>    <dbl> <dbl>    <dbl>
 1 Student loan                navient    13464 0.00473  0.693 0.00328 
 2 Student loan                fedloan     1774 0.000623 1.39  0.000863
 3 Credit card or prepaid card amex        4421 0.000528 1.39  0.000732
 4 Mortgage                    phh         3778 0.000410 1.39  0.000568
 5 Student loan                pslf        2307 0.000810 0.693 0.000561
 6 Student loan                nelnet      2273 0.000798 0.693 0.000553
 7 Mortgage                    shellpoint  3564 0.000387 1.39  0.000536
 8 Vehicle loan or lease       gm          1612 0.000743 0.693 0.000515
 9 Mortgage                    loancare    3371 0.000366 1.39  0.000507
10 Mortgage                    sps         3334 0.000362 1.39  0.000502
# ℹ 106,073 more rows

The following graph shows words that appear to be very relevant for identifying the complaints. For example, several car brands appear in the Vehicle loan or lease category.

complaint_tf_idf  |> 
  group_by(product)  |> 
  slice_max(tf_idf, n = 15) |> 
  ungroup()  |> 
  ggplot(
    aes(tf_idf, 
        fct_reorder(word, tf_idf), 
        fill = product)
    ) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ product, 
             ncol = 2, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

Building a classification algorithm

The data is imbalanced. There are four times more complaints related to Credit card or prepaid card than to Vehicle loan or lease.

df_train |> 
  count(product)

# A tibble: 4 × 2
  product                         n
  <chr>                       <int>
1 Credit card or prepaid card 38294
2 Mortgage                    30957
3 Student loan                12485
4 Vehicle loan or lease        9239

We will perform downsampling to deal with the imbalance between the classes of products (step_downsample())

We defined a recipe for text pre-processing and feature engineering for the complaint variable, which will be used to predict the product category. The recipe includes tokenization, filtering out infrequent tokens, and weighting the tokens using the term frequency-inverse document frequency (tf-idf) to downsample the majority class. The resulting recipe will be used in downstream modeling steps.

complaints_recipe <-
  recipe(product ~ complaint,
         data = df_train) |> 
  step_tokenize(complaint) |> 
  step_tokenfilter(complaint,
                   max_tokens = 1e3) |>
  step_tfidf(complaint) |> 
  step_downsample(product)

complaints_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 1

── Operations

• Tokenization for: complaint

• Text filtering for: complaint

• Term frequency-inverse document frequency with: complaint

• Down-sampling based on: product

We defined a multinomial logistic regression model using the multinom_reg function from the parsnip package. The regularization penalty is set to 0.01, and the mixture is set to 1. The model is set to perform classification and uses the glmnet engine.

multi_lasso_spec <- 
  multinom_reg(penalty = 0.01, 
               mixture = 1)  |> 
  set_mode("classification")  |> 
  set_engine("glmnet")

multi_lasso_spec

Multinomial Regression Model Specification (classification)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet

We created a workflow that includes the recipe for processing the text data and the multinomial logistic regression model with L1 regularization specified earlier. The workflow allows us to fit the model and apply it to new data using the same pre-processing steps.

multi_lasso_wf <- 
  workflow()  |> 
  add_recipe(complaints_recipe)  |> 
  add_model(multi_lasso_spec)

multi_lasso_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: multinom_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_tokenize()
• step_tokenfilter()
• step_tfidf()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
Multinomial Regression Model Specification (classification)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet

This code creates cross-validation folds for the complaints data using the vfold_cv function from the rsample package. The resulting complaints_folds object contains a list of training and validation sets for each fold, which can be used to train and evaluate models.

set.seed(123)

complaints_folds <- 
  vfold_cv(df_train)

complaints_folds

#  10-fold cross-validation 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [81877/9098]> Fold01
 2 <split [81877/9098]> Fold02
 3 <split [81877/9098]> Fold03
 4 <split [81877/9098]> Fold04
 5 <split [81877/9098]> Fold05
 6 <split [81878/9097]> Fold06
 7 <split [81878/9097]> Fold07
 8 <split [81878/9097]> Fold08
 9 <split [81878/9097]> Fold09
10 <split [81878/9097]> Fold10

We set the seed for reproducibility and register a parallel backend using the doParallel package. Then, we fit the multi_lasso_wf workflow using fit_resamples() function, which performs repeated k-fold cross-validation on the training data complaints_folds and saves the predictions for each fold using the control_resamples() function.

set.seed(678)

doParallel::registerDoParallel()

multi_lasso_rs <- 
  fit_resamples(
  multi_lasso_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)

These lines of code collect the performance metrics and the predicted values obtained from the resampling process of the previously fitted multinomial logistic regression model using Lasso penalty (multi_lasso_wf). The collect_metrics function is used to obtain the accuracy, precision, recall, and F1-score metrics, among others. On the other hand, the collect_predictions function is used to collect the predicted values and the true labels of the test sets for each fold, which can be used to calculate additional performance metrics or to create visualizations.

m_lasso_rs_metrics <- 
  collect_metrics(multi_lasso_rs)

m_lasso_rs_predictions <- 
  collect_predictions(multi_lasso_rs)

The results indicate that the model has a high accuracy, with a mean value of 0.923 and a low standard error. Regarding the roc_auc metric, the model also performs well with a mean value of 0.987, indicating that it has good discriminatory power. These results suggest that the model is well-suited for the task and can accurately classify consumer complaints. However, it is important to consider the potential impact of imbalanced data on the model’s performance and whether any further optimization or evaluation of the model is necessary.

m_lasso_rs_metrics

# A tibble: 2 × 6
  .metric  .estimator  mean     n  std_err .config             
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy multiclass 0.923    10 0.000681 Preprocessor1_Model1
2 roc_auc  hand_till  0.987    10 0.000434 Preprocessor1_Model1

The code creates a ROC curve plot to evaluate the performance of the multi-class Lasso model in predicting different financial products and services. The m_lasso_rs_predictions object is being piped into the group_by function to group the data by id. Then, the roc_curve function is used to calculate the true positive rate and false positive rate for each fold and each class.

m_lasso_rs_predictions  |> 
  group_by(id)  |> 
  roc_curve(truth = product, c(".pred_Credit card or prepaid card",
                               ".pred_Mortgage",
                               ".pred_Student loan",
                               ".pred_Vehicle loan or lease"))  |> 
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for financial products and services",
    subtitle = "Each resample fold is shown in a different color"
  )

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the yardstick package.
  Please report the issue at <https://github.com/tidymodels/yardstick/issues>.

conf_mat_resampled(multi_lasso_rs, 
                   tidy = FALSE) %>%
  autoplot(type = "heatmap")

Tuning lasso hyperparameters

The following chunks tune the parameters for lasso regression and fit the model with the training data.

tune_spec <- 
  multinom_reg(penalty = tune(), 
               mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

tune_spec

Multinomial Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet

lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid

# A tibble: 30 × 1
    penalty
      <dbl>
 1 1   e-10
 2 2.21e-10
 3 4.89e-10
 4 1.08e- 9
 5 2.40e- 9
 6 5.30e- 9
 7 1.17e- 8
 8 2.59e- 8
 9 5.74e- 8
10 1.27e- 7
# ℹ 20 more rows

tune_wf <- 
  workflow()  |> 
  add_recipe(complaints_recipe)  |> 
  add_model(tune_spec)

tune_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: multinom_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_tokenize()
• step_tokenfilter()
• step_tfidf()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
Multinomial Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet

doParallel::registerDoParallel()
set.seed(2023)

tune_rs <- 
  tune_grid(
  tune_wf,
  complaints_folds,
  grid = lambda_grid,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(tune_rs)

# A tibble: 60 × 7
    penalty .metric  .estimator  mean     n  std_err .config              
      <dbl> <chr>    <chr>      <dbl> <int>    <dbl> <chr>                
 1 1   e-10 accuracy multiclass 0.923    10 0.000766 Preprocessor1_Model01
 2 1   e-10 roc_auc  hand_till  0.989    10 0.000311 Preprocessor1_Model01
 3 2.21e-10 accuracy multiclass 0.923    10 0.000766 Preprocessor1_Model02
 4 2.21e-10 roc_auc  hand_till  0.989    10 0.000311 Preprocessor1_Model02
 5 4.89e-10 accuracy multiclass 0.923    10 0.000766 Preprocessor1_Model03
 6 4.89e-10 roc_auc  hand_till  0.989    10 0.000311 Preprocessor1_Model03
 7 1.08e- 9 accuracy multiclass 0.923    10 0.000766 Preprocessor1_Model04
 8 1.08e- 9 roc_auc  hand_till  0.989    10 0.000311 Preprocessor1_Model04
 9 2.40e- 9 accuracy multiclass 0.923    10 0.000766 Preprocessor1_Model05
10 2.40e- 9 roc_auc  hand_till  0.989    10 0.000311 Preprocessor1_Model05
# ℹ 50 more rows

autoplot(tune_rs) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identity the best penalty"
  )

tune_rs |> 
  show_best("roc_auc")

# A tibble: 5 × 7
   penalty .metric .estimator  mean     n  std_err .config              
     <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>                
1 0.000788 roc_auc hand_till  0.991    10 0.000321 Preprocessor1_Model21
2 0.00174  roc_auc hand_till  0.991    10 0.000348 Preprocessor1_Model22
3 0.000356 roc_auc hand_till  0.991    10 0.000304 Preprocessor1_Model20
4 0.000161 roc_auc hand_till  0.990    10 0.000301 Preprocessor1_Model19
5 0.00386  roc_auc hand_till  0.990    10 0.000370 Preprocessor1_Model23

chosen_auc <- 
  tune_rs %>%
  select_by_one_std_err(metric = "roc_auc", -penalty)

chosen_auc

# A tibble: 1 × 9
  penalty .metric .estimator  mean     n  std_err .config           .best .bound
    <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>             <dbl>  <dbl>
1 0.00174 roc_auc hand_till  0.991    10 0.000348 Preprocessor1_Mo… 0.991  0.991

final_multi_lasso <- 
  finalize_workflow(tune_wf, chosen_auc)

final_multi_lasso

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: multinom_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_tokenize()
• step_tokenfilter()
• step_tfidf()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
Multinomial Regression Model Specification (classification)

Main Arguments:
  penalty = 0.00174332882219999
  mixture = 1

Computational engine: glmnet

doParallel::registerDoParallel()

fitted_multi_lasso <- fit(final_multi_lasso, 
                          complaint_clean)

fitted_multi_lasso |> 
  extract_fit_engine() |> 
  tidy() |>
  arrange(desc(estimate)) |> 
  print(n = 40)

# A tibble: 131,873 × 6
   class                       term              step estimate  lambda dev.ratio
   <chr>                       <chr>            <dbl>    <dbl>   <dbl>     <dbl>
 1 Credit card or prepaid card tfidf_complaint…   100     358. 2.51e-5     0.908
 2 Mortgage                    tfidf_complaint…   100     356. 2.51e-5     0.908
 3 Mortgage                    tfidf_complaint…    99     354. 2.75e-5     0.908
 4 Mortgage                    tfidf_complaint…    98     351. 3.02e-5     0.908
 5 Credit card or prepaid card tfidf_complaint…    99     349. 2.75e-5     0.908
 6 Mortgage                    tfidf_complaint…    97     349. 3.31e-5     0.908
 7 Mortgage                    tfidf_complaint…    96     346. 3.64e-5     0.907
 8 Mortgage                    tfidf_complaint…    95     344. 3.99e-5     0.907
 9 Mortgage                    tfidf_complaint…    94     341. 4.38e-5     0.907
10 Mortgage                    tfidf_complaint…    93     338. 4.81e-5     0.907
11 Mortgage                    tfidf_complaint…    92     335. 5.28e-5     0.907
12 Credit card or prepaid card tfidf_complaint…    98     334. 3.02e-5     0.908
13 Mortgage                    tfidf_complaint…    91     332. 5.79e-5     0.906
14 Mortgage                    tfidf_complaint…    90     329. 6.35e-5     0.906
15 Mortgage                    tfidf_complaint…    89     325. 6.97e-5     0.906
16 Credit card or prepaid card tfidf_complaint…    97     324. 3.31e-5     0.908
17 Mortgage                    tfidf_complaint…    88     322. 7.65e-5     0.905
18 Mortgage                    tfidf_complaint…    87     318. 8.40e-5     0.905
19 Mortgage                    tfidf_complaint…    86     315. 9.22e-5     0.904
20 Mortgage                    tfidf_complaint…    85     312. 1.01e-4     0.904
21 Credit card or prepaid card tfidf_complaint…    96     311. 3.64e-5     0.907
22 Mortgage                    tfidf_complaint…    84     308. 1.11e-4     0.903
23 Mortgage                    tfidf_complaint…    83     305. 1.22e-4     0.903
24 Mortgage                    tfidf_complaint…    82     301. 1.34e-4     0.902
25 Credit card or prepaid card tfidf_complaint…    95     301. 3.99e-5     0.907
26 Mortgage                    tfidf_complaint…    81     297. 1.47e-4     0.901
27 Mortgage                    tfidf_complaint…    80     293. 1.61e-4     0.900
28 Credit card or prepaid card tfidf_complaint…    94     292. 4.38e-5     0.907
29 Mortgage                    tfidf_complaint…    79     289. 1.77e-4     0.900
30 Mortgage                    tfidf_complaint…    78     285. 1.94e-4     0.899
31 Mortgage                    tfidf_complaint…    77     281. 2.13e-4     0.898
32 Credit card or prepaid card tfidf_complaint…    93     281. 4.81e-5     0.907
33 Mortgage                    tfidf_complaint…    76     278. 2.34e-4     0.896
34 Mortgage                    tfidf_complaint…    75     274. 2.57e-4     0.895
35 Credit card or prepaid card tfidf_complaint…    92     272. 5.28e-5     0.907
36 Mortgage                    tfidf_complaint…    74     271. 2.82e-4     0.894
37 Student loan                tfidf_complaint…   100     269. 2.51e-5     0.908
38 Student loan                tfidf_complaint…    98     267. 3.02e-5     0.908
39 Mortgage                    tfidf_complaint…    73     267. 3.09e-4     0.893
40 Student loan                tfidf_complaint…    99     267. 2.75e-5     0.908
# ℹ 131,833 more rows

fitted_multi_lasso |> 
  glance()

# A tibble: 1 × 3
  nulldev npasses  nobs
    <dbl>   <int> <int>
1 102464.    3850 36956

Let’s make some predictions!

In the second part of the evaluation, we must make predictions using the test data. With the model fitted previously, we managed to get 95% of the quiz right.

df_test <-
  read_csv("./data/test.csv") |> 
  clean_names() |> 
  rename("complaint" = consumer_complaint_narrative)

df_test <- 
  predict(fitted_multi_lasso, df_test) |> 
  bind_cols(df_test)

References

Hvitfeldt, Emil, and Julia Silge. 2021. Supervised Machine Learning for Text Analysis in r. https://doi.org/10.1201/9781003093459.

Silge, Julia, and David Robinson. 2019. Text Mining with r: A Tidy Approach. O’Reilly Media.