1 Reading and Cleaning the Data

library(tidyverse)
library(tidymodels)
library(here)
library(janitor)

train <- read_csv(here::here("data", "tidy_data", "data_complaints_train.csv"))

library(skimr)
#glimpse(train)
#library(visdat)
#vis_dat(train)
#vis_miss(train)

skim(train)

Table 1.1: Data summary
Name	train
Number of rows	90975
Number of columns	6
_______________________
Column type frequency:
character	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Product	1	8	27	4
Consumer complaint narrative	1	12	32317	90240
Company	1	4	82	1497
State	1	2	36	62
ZIP code	1	4	5	5484
Submitted via	1	3	3	1

There is no missing data, all columns have a complete rate. The first observations from skimming (using the skim() function from skimr package) are the following:

The names of variables should be cleaned, in order to respect the tidy philosophy for variable names. This will be done using clean_names().
The variable Submitted via has one unique value, hence it injects no information to the data and it shouldn’t be used a predictor. The variable zip_code is redundant: many of the values corresponding to this variable have unavailable last two digits, which gives granularity with respect to the geographical region within a state so effectively the only information available is the one corresponding to the state, which is only contained in the state variable. Both variables Submitted via and zip_code will be dropped.
For simplicity, the variable consumer_ complaint_narrative will be renamed to complaint_narrative.

train_tidy <- train |> 
    clean_names() |> 
    select(-c(submitted_via, zip_code)) |>
    rename(complaint_narrative = consumer_complaint_narrative) |>
    mutate(product = factor(product))

#train_tidy

2 Exploratory Data Analysis

For classifying the type of complaint (i.e. the variable product) based on the available potential predictors, the main interest will be the variable complaint_narrative. Since it contains the complaint, it can be mined for words that can inject information about the potential value of the complaint category (product).

We will first explore the other potential predictors, company, state and zip_code in order to determine if they can be used for the model. This will be done by investigating if the potential predictor can inject any information with respect to the outcome variable.

We will start by first looking at the outcome variable. There are no missing data and we know it’s a categorical variable with 4 values. We will to check if there is an unbalance in the data.

2.1 The `product` outcome

In this section the outcome is explored, i.e. the product variable. The goal here is to check if there is an unbalance in the variable, i.e. if the outcome’s levels are unbalanced. We do this by looking at the distribution of the product variable.

library(viridis)
par(mar = c(4, 4, .1, .1))
train_tidy |>
    count(product) |>
    mutate(product_freq = n / sum(n)) |>
    ggplot(aes(x = fct_reorder(product, product_freq), 
               y = product_freq)) +
    geom_col(aes(fill = n)) + 
    geom_text(aes(label = scales::percent(product_freq)), 
              vjust=-.5, size=4, fontface = "bold") +
    labs(x = "Product",
         y = "Product Frequency",
         fill = "Product Counts") +
    scale_fill_viridis(labels = scales::unit_format(unit = "k",
                                                     scale = 1e-3)) +
    theme_bw()

Figure 2.1: Percentage and counts of complaints over product levels. The counts are showed in the legend.

The data is unbalanced. However, the level that is the less represented, Vehicle loan or lease, representing only 10.2% of the total observations, has roughly 10k corresponding observations, hence the unbalance in the data might not be an issue for the classification. However, this can be addressed easily by including an additional step in the recipe and downsampling the data in order to assure equal numbers between the (four) levels. The training set will be downsampled in order for model to be trained on balanced data.

2.2 The `company` variable

In this section the company variable is explored. There are 90,975 available observations and 1497 companies. If most of the companies are associated with all (or multiple) types of products, then not much information can be injected in the model using this variable, since there is no discrimination between the companies with respect to their corresponding product. However, if there are companies that provide exclusively one product (i.e. one specific value corresponding to the product) then a new variable (feature) can be created, coding the corresponding product value for the companies that have exclusively one product and codding the companies that handle multiple products with another code. This can be informative and improve the classification roc-auc provided the number of observations in the data set corresponding to the companies offering only one product is significant.

A new feature based on the company variable is created, company_profile with five levels, four levels corresponding to the unique type of product they are associated with and the fifth indicating the company is associated with multiple products. We look at the distribution of this new variable and at the distribution of the complaints conditional on this new variable.

product_levels <- unique(train_tidy$product)
#product_levels

company_by_product <- train_tidy |> 
    group_by(product) |> 
    distinct(company)
#company_by_product

credit <- company_by_product |> 
    filter(product == product_levels[1]) |>
    pull(company)

mortage <- company_by_product |>
    filter(product == product_levels[2]) |>
    pull(company)
                            
student <- company_by_product |>
    filter(product == product_levels[3]) |>
    pull(company)
                            
vehicle <- company_by_product |>
    filter(product == product_levels[4]) |>
    pull(company)

only_credit_companies <- list(credit, mortage, student, vehicle) |>
    reduce(setdiff)

only_mortage_companies <- list(mortage, credit, student, vehicle) |>
    reduce(setdiff)
 
only_student_companies <- list(student, credit, mortage, vehicle) |>
    reduce(setdiff)

only_vehicle_companies <- list(vehicle, credit, mortage, student) |>
    reduce(setdiff)

Getting the number of companies associated exclusively with one type (level) of service:

list(only_credit_companies, only_mortage_companies, 
     only_student_companies, only_vehicle_companies) |>
    map(length) |>
    unlist() |>
    sum()

## [1] 1293

Between the 1497 unique values of the variable company, 1293 are associated exclusively to one type of product (a unique value corresponding to the product variable), i.e. roughly 86% of the companies handle only one type of product, hence the idea of potentially using a new feature based on this variable in order to inject information and increase the classification roc-auc seems reasonable. However, this depends on the number of complaints (observations) in the data set that correspond to companies handling exclusively one product.

train_tidy <- train_tidy |>
    mutate(company_profile = case_when(company %in% only_credit_companies ~ "credit",
                                   company %in% only_mortage_companies ~ "mortage",
                                   company %in% only_student_companies ~ "student",
                                   company %in% only_vehicle_companies ~ "vehicle",
                                   TRUE ~ "multiple") |>
           factor())

#train_tidy

We look at the distribution of the company profile distribution and at the distribution of the complaints over the five profile levels of company_profile.

par(mar = c(4, 4, .1, .1))
# Bar plot number of companies per company profile
train_tidy |>
    group_by(company_profile) |>
    summarise(number_of_companies = length(unique(company))) |>
    mutate(frequency = number_of_companies / sum(number_of_companies)) |>
    ggplot(aes(x = fct_reorder(company_profile, frequency),
               y = frequency)) +
    geom_col(aes(fill = number_of_companies)) +
    geom_text(aes(label = scales::percent(frequency)),
              vjust = -.5, size = 4, fontface = "bold") +
    labs(x = "Product Category",
         y = "Companies Frequency",
         fill = "Company Counts") +
    scale_fill_viridis() + 
    theme_bw()
    
# Bar plot number of complaints (observations) per company profile
train_tidy |> 
    count(company_profile) |>
    mutate(company_profile_freq = n / sum(n)) |>
    ggplot(aes(x = fct_reorder(company_profile, company_profile_freq),
               y = company_profile_freq)) +
    geom_col(aes(fill = n)) +
    geom_text(aes(label = scales::percent(company_profile_freq)),
              vjust = -.5, size = 4, fontface = "bold") +
    labs(x = "Companies profile",
         y = "Complaints Frequency",
         fill = "Complaints Counts") +
    scale_fill_viridis(labels = scales::unit_format(unit = "k",
                                                    scale = 1e-3)) +
    theme_bw()

Figure 2.2: Percentage of companies (left) and complaints (right) by product category. The counts are showed in the corresponding legends.

The bar plots show that only 13.6% of the companies provide multiple type of products, but are associated with more than 80% of the complaints. The new feature company_profile - created based on the company variable - might inject, some amount of information in the classification model in order to improve the roc-auc hence:

the new feature company_profile will be kept as a model predictor.
the (original) variable company is dropped.

train_tidy <- train_tidy |> 
    select(-c(company))

2.3 The `state` variable

The state variable has 62 levels, corresponding to the abbreviations of the 50 states, together with DC (District of Columbia), PR (Puerto Rico), GU (Guam), VI (Virgin Islands), AS (American Samoa), FM (Federated States of Micronesia), MH (Marshall Islands) UNITED STATES MINOR OUTLYING ISLANDS, AE (armed forces in Europe, the Middle East, Africa, and Canada), AP armed forces in the Pacific), AA (armed forces in the Americas excluding Canada) and None.

We first change the value UNITED STATES MINOR OUTLYING ISLANDS to OUT in the state variable.

train_tidy <- train_tidy |>
    mutate(state = case_match(state, "UNITED STATES MINOR OUTLYING ISLANDS" ~ "OUT",
                              .default = state))

We check if there is a visible variation of the frequencies of complaints corresponding to each of the four levels of the product within stats, indicating that the state variable can be informative for the model. The frequencies are not relevant if the absolute number of complaints corresponding to a zip code is small. This is expected to be the case with the abbreviations corresponding to the territories outside the 50 states and DC, so first we check the number of complaints corresponding to these zip codes.

outer_territories = setdiff(unique(train_tidy$state), c(state.abb, "DC"))
train_tidy |>
    filter(state %in% outer_territories) |>
    pull(state) |>
    table() |>
    t() |>
    knitr::kable(caption = "Number of observations per outer territories")

Table 2.1: Number of observations per outer territories
AA	AE	AP	AS	FM	GU	MH	None	OUT	PR	VI
6	45	30	2	8	11	1	349	15	192	11

Setting the threshold at 50, we remove the values AA, AE, AP, AS, FM, GU, MH and VI.

train_tidy |>
    filter(!state %in% c("AA", "AE", "AP", "AS", "FM", "GU", "MH", "VI")) |>
    group_by(state, product) |>
    summarise(counts = n()) |>
    mutate(freq = counts/sum(counts)) |>
    ungroup() |>
    complete(state, product, fill = list(counts = 0, freq = 0)) |>
    ggplot(aes(x = state, y = freq, color = product)) +
    geom_point()+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          legend.position="top",
          axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=1)) +
    labs(x = "States",
         y = "Frequency")

Figure 2.3: Frequency of outcome (complaints) levels per each state (only states/regions with more than 50 complaints per state/region).

The frequencies between the four levels of complaints seem to be fairly constant across states, hence

the state variable will be dropped.

train_tidy <- train_tidy |> 
    select(-state)

2.4 The `complaint_narrative` variable

In this section the complaint_narrative variable is explored. Clearly, this variable is the most informative since text mining can be used in order to search for words that are predictive to each of the four levels of the product variable and new predictor(s) can be created that will be informative.

We check the first two observations corresponding to the complaint_narrative variable.

train_tidy |>
    head(2) |>
     pull(complaint_narrative)

## [1] "I initially in writing to Chase Bank in late XXXX about a charge that was unauthorized on XXXX from XXXX in the amount of {$790.00} after many letters and complaints they still did not do anything yet in XXXX they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost 8 months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in XXXX XXXX XXXX XXXX against them"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with XXXX XXXX back in XX/XX/XXXX. I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in XX/XX/XXXX ( she is now XXXX of age ) as this happened years ago. \n\nNow after 21 years, XXXX XXXX XXXX XXXX  filled a XXXX XXXX to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS XX/XX/XXXX which stated that due to DITECH Mortgage sending a XXXX with taxable income of {$17000.00}, this could cause me to have to owe the IRS {$2900.00}. IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the XXXX in order ( so it is not XXXX, XXXX taxable income ). \n\nI contacted DITECH at XXXX spoke with a \" XXXX '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of XXXX  XXXX of a debt from a mortgage that I had back in XX/XX/XXXX with DITECH nor did I ever receive a XXXX from them. I asked them to rescind the XXXX ( taxable income of {$17000.00} ) as this is a debt from \" 20 Years ago '' in which I should not be taxed for taxable income of XXXX, XXXX on a debt from 20 years ago. However, XXXX refused to rescind ( reverse ) the XXXX. \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the 7 to 10 year period that the debt is lawfully removed from my credit. They can't go back 20 years later AND IN XX/XX/XXXX send the IRS and myself a XXXX form to tax me {$17000.00} for a debt from over 20 years ago that WAS charged off my credit report back in the year XX/XX/XXXX!!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over 20 years now and it has negatively affected my credit for way over the 7 year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT {$17000.00} then they should have taxed me 20 yrs ago ( BEFORE THE 7 TO 10 YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a XXXX taxable income, they should have done this YEARS AGO NOT IN XX/XX/XXXX. I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE 20 YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN XX/XX/XXXX ON A DEBT FROM THE YEAR XX/XX/XXXX"

2.4.1 Cleaning the text

The text “XXXX” appears multiple times in most of the complaints. It will be removed. Also, there are multiple characters that can be removed without losing information: dates, digits, punctuation and breaks. Everything is transformed to lower case.

train_tidy <- train_tidy |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "X(X+)|x(x+)")) |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "..\\/..\\/....")) |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "\\{\\$[0-9\\.]*\\}")) |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "\n_*\n")) |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "(_+\\s?)+")) |>
    mutate(complaint_narrative = str_remove_all(complaint_narrative,
                                                 "[0-9]+")) |>
    mutate(complaint_narrative = tolower(complaint_narrative))

train_tidy |>
    head(2) |>
     pull(complaint_narrative)

## [1] "i initially in writing to chase bank in late  about a charge that was unauthorized on  from  in the amount of  after many letters and complaints they still did not do anything yet in  they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost  months later it is still on my statement and i'm getting charged interest. even after this i still use my card and pay for my charges and have never been late however i will not pay for a charge i didn't make and refuse and chase refuses to do anything even after many attempts and letters they write stating so. i feel this is last step before i begin a case in     against them"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "my ex husband and myself had a mobile home ( home mortgage ) with   back in //. i went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. the mortgage was in both of our names. we only had the mortgage for one year and they got the trailer bc of being unable to pay debt. i was not working at the time as i had just given birth to my first child in // ( she is now  of age ) as this happened years ago. now after  years,      filled a   to the irs and the irs alerted me which caused me an adverse affect on my irs taxes. \ni received a l letter from the irs // which stated that due to ditech mortgage sending a  with taxable income of , this could cause me to have to owe the irs . irs told me to contact ditech financial to dispute this and request that ditech mortgage rescind the  in order ( so it is not ,  taxable income ). i contacted ditech at  spoke with a \"  '' and i told her that i did not want to be recorded but she did it anyways. \ni told her that i never settled on an amount of    of a debt from a mortgage that i had back in // with ditech nor did i ever receive a  from them. i asked them to rescind the  ( taxable income of  ) as this is a debt from \"  years ago '' in which i should not be taxed for taxable income of ,  on a debt from  years ago. however,  refused to rescind ( reverse ) the . \nnote, i have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the  to  year period that the debt is lawfully removed from my credit. they can't go back  years later and in // send the irs and myself a  form to tax me  for a debt from over  years ago that was charged off my credit report back in the year //!!! this is very unfair and not allowed. \nalso< i disputing the fact that they kept putting this on and off my credit for over  years now and it has negatively affected my credit for way over the  year period. \npoint being< if they had plans of taxing me at  then they should have taxed me  yrs ago ( before the  to  year period passed which is now way over ). if they were going to send a  taxable income, they should have done this years ago not in //. i am requesting a complaint submitted as they have unfairly affected my credit past the  year period and it is violating my rights to be taxed for something in // on a debt from the year //"

We tokenize the data and, count and sort the words and remove the stopping words.

library(tidytext)
data(stop_words)
train_tidy_tok <- train_tidy |>
    unnest_tokens(word,
                  complaint_narrative) |>
    count(product, word,
          sort = TRUE) |>
    anti_join(stop_words,
              by = join_by(word))

train_tidy_tok |>
    head(5) |>
    knitr::kable(caption = "Tokenized words (word) in the training set sorted by their counts (n) together with the corresponding level in the outcome variable.")

Table 2.2: Tokenized words (word) in the training set sorted by their counts (n) together with the corresponding level in the outcome variable.
product	word	n
Credit card or prepaid card	card	91717
Credit card or prepaid card	credit	86691
Mortgage	mortgage	72709
Mortgage	loan	72184
Credit card or prepaid card	account	69050

The distribution of the first 20 most frequent words contained in the complaints, for each of the product:

train_tidy_tok |>
    group_by(product) |>
    slice_max(n, n = 20) |>
    ggplot(aes(x = reorder_within(word, n, product),
               y = n,
               fill = product)) +
    geom_col(show.legend = FALSE) +
    scale_x_reordered() +
    coord_flip() +
    facet_wrap(. ~ product, scales = "free_y") +
    theme_bw() +
    labs(x = "", y = "")

Words in complaints sorted by their frequency, conditional on the product level. There is a visible overlap between the levels in the outcome variable.

Figure 2.4: Words in complaints sorted by their frequency, conditional on the product level. There is a visible overlap between the levels in the outcome variable.

We notice a significant overlap between the most frequent words in the four product levels. For instance, between the most frequent 20 words between the four product levels, 8 words are common.

train_tidy_tok |>
    group_by(product) |>
    slice_max(n, n = 20) |>
    summarise(frequent_words = unique(list(word))) |>
    pull(frequent_words) |>
    reduce(intersect) |>
    t() |>
    knitr::kable(caption = "Common words between the most 20 frequent words in the complaint four levels ")

Table 2.3: Common words between the most 20 frequent words in the complaint four levels
account	payment	told	called	received	time	due	pay

And between the most frequent 50 words between the four product levels, 23 words are common.

train_tidy_tok |>
    group_by(product) |>
    slice_max(n, n = 50) |>
    summarise(frequent_words = unique(list(word))) |>
    pull(frequent_words) |>
    reduce(intersect) |>
    length()

## [1] 23

Creating features by using as measure of importance the words frequency won’t inject much useful information, since there is a significant overlap between the classes. We consider as a measure of importance of words in the complaints the term frequency–inverse document frequency.

train_tfidf<- train_tidy_tok |>
     bind_tf_idf(term = word,
              document = product, 
              n = n)
train_tfidf |>
    arrange(desc(tf_idf)) |>
    head(10) |>
    knitr::kable(caption = "Sorted tokenized words in the training set")

Table 2.4: Sorted tokenized words in the training set
product	word	n	tf	idf	tf_idf
Student loan	navient	13432	0.0133355	0.6931472	0.0092435
Student loan	fedloan	1769	0.0017563	1.3862944	0.0024347
Credit card or prepaid card	amex	4401	0.0014585	1.3862944	0.0020219
Student loan	pslf	2303	0.0022865	0.6931472	0.0015848
Student loan	nelnet	2258	0.0022418	0.6931472	0.0015539
Mortgage	phh	3767	0.0011085	1.3862944	0.0015367
Vehicle loan or lease	gm	1608	0.0021144	0.6931472	0.0014656
Mortgage	shellpoint	3552	0.0010452	1.3862944	0.0014490
Mortgage	loancare	3359	0.0009885	1.3862944	0.0013703
Mortgage	sps	3314	0.0009752	1.3862944	0.0013519

train_tfidf |>
    group_by(product) |>
    slice_max(tf_idf, n = 20) |>
    ggplot(aes(x = reorder_within(word, tf_idf, product),
               y = tf_idf,
               fill = product)) +
    geom_col(show.legend = FALSE) +
    scale_x_reordered() +
    coord_flip() +
    facet_wrap(. ~ product, scales = "free_y") +
    theme_bw() +
    labs(x = "", y = "")

Figure 2.5: Words in complaints sorted by tg-idf conditional on the product level, with no overlap between outocome levels.

3 Model selection

In this section multiple classification models are tested. Namely, multiclass logistic regression models (ridge, lasso, elastic net) and random forests.

For all model classes considered (e.g. ridge, lasso, random forests), the parameters are tuned using a grid search. The number of candidates parameters is set to a constant value for tuning all considered models.

number_of_candidate_param = 20

For each model class (e.g. ridge, lasso, random forests) the model selection is done using k-folds cross-validation. The number of partitions (folds) is set to a constant value for model selection for all considered models.

number_of_partitions = 20

In order to compare the considered models:

For each class of models model, the same number of candidate parameters are evaluated, with the evaluation done using k-folds cross-validation.
The model with the best roc-auc mean evaluation over the k-folds cross-validation is retained from that class, with the roc-auc estimate corresponding to the training split (i.e. the 1/3 of the available training data that was not used for training)
The best models in each class are compared in terms of the roc-auc estimate.

In the first section the testing data is transformed using the same procedure used for the training data (i.e. factor transformation for the output variable, dropping all the columns that are not predictive and won’t be included in model). The second section shows the two recipes (corresponding to one and two predictors models) considered during the simulations. Then next sections are dedicated to the different models that are tested. In the last section, all the tested models and their associated roc-auc scores, corresponding to the tuned parameters - found via the grid search - and selected via 10-fold cross validation are presented.

3.1 The test data

In the exploratory data analysis we established that:

the outcome product has to be transformed into a factor and the recipe should include a downsample step in order to account for the inbalance in the data, see Figure 2.1.
the submitted_via variable can’t inject any information (having an unique value) hence it is dropped.
the zip_code variable does not inject any information since it’s containing effectively the same information as the state variable hence it is dropped. (This is due to the fact that for most observations the zip_code has the last two digits unavailable.)
the state doesn’t seem predictive for the outcome levels since each outcome level has a fairly constant frequency across states (i.e. each outcome level is uniformly distributed across the states), see Figure 2.3. It is dropped as a potential predictor.
the company variable is dropped.
a new (categorical) feature (derived from company) with five levels company_profile it is retained. How the complaints are distributed over the company_profile levels can be seen in Figure 2.2.
the complaint_narrative is retained and is used as (the main) predictor, using it to derive features from it via a term frequency–inverse document frequency measure.

The number of columns in the training set, after dropping the columns that are not predictive is 3.

product (the outcome),
complaint_narrative (predictor via tf-idf),
company_profile (predictor via one-hot encoding)

train_tidy |>
    head(1) |>
    knitr::kable(caption = "Train set")

Table 3.1: Train set
product	complaint_narrative	company_profile
Credit card or prepaid card	i initially in writing to chase bank in late about a charge that was unauthorized on from in the amount of after many letters and complaints they still did not do anything yet in they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost months later it is still on my statement and i’m getting charged interest. even after this i still use my card and pay for my charges and have never been late however i will not pay for a charge i didn’t make and refuse and chase refuses to do anything even after many attempts and letters they write stating so. i feel this is last step before i begin a case in against them	multiple

Both predictors are used in the model indirectly, i.e. in the recipe new features will be constructed from the complaint_narrative using the tf-idf and from the company_profile using one hot encoding.

The same transformation (i.e. factorizing the outcome, dropping submitted_via, zip_code, company adding the company_profile) are applied on the testing set.

test <- read_csv(here::here("data", "tidy_data", "data_complaints_test.csv"))

#head(test_set)

test_tidy <- test |>
    clean_names() |>
    select(-c(problem_id, submitted_via, zip_code)) |>
    rename(complaint_narrative = consumer_complaint_narrative)

test_tidy <- test_tidy |>
    mutate(company_profile = case_when(company %in% only_credit_companies ~ "credit",
                                   company %in% only_mortage_companies ~ "mortage",
                                   company %in% only_student_companies ~ "student",
                                   company %in% only_vehicle_companies ~ "vehicle",
                                   TRUE ~ "multiple") |>
           factor())

test_tidy <- test_tidy |>
    select(-c(company, state)) |>
    mutate()

test_tidy |>
    head(1) |>
    knitr::kable(caption = "Test set")

Table 3.2: Test set
complaint_narrative	company_profile
I have multiple lateness/missed payments on my credit report due to ECMC not properly contacting me to inform me when payments were due. I never received any statements in regards to my loan. It was also a period in time when my loan was deferred so there should not be any late/ missed payments that is being reporting on my credit file.	multiple

3.2 Splitting the data

We split the data via the initial_split, extract the traing and testing data we also prepare the data_vfold for the cross validation, using 10 folds.

set.seed(42)
data_split <- initial_split(train_tidy, prop = 2/3)
data_split

## <Training/Testing/Total>
## <60650/30325/90975>

data_train <- training(data_split)
data_test <- testing(data_split)

data_vfold <- rsample::vfold_cv(data = data_train, 
                                v = number_of_partitions)

3.3 Making the recipe

As mentioned in the EDA section, we test for the two versions of models in terms of the predictors used:

first model is using as the unique predictor the complaint_narrative variable.
the second model is using, additionally, the company_profile variable.

For the complaint_narrative variable, it first tokenized (using step_tokenize from textrecipes), the stop words are removed, then used for creating new feature using the term frequency–inverse document frequency.

library(textrecipes)
library(themis)

# The two predictor recipe, using both complaint_narative and company_profile variable
rec <- data_train |>
    recipe(product ~ .) |>
    # One hot encoding for the factor variable
    step_dummy(company_profile,
               one_hot = TRUE) |>
    # tokenization
    step_tokenize(complaint_narrative) |>
    # stemming (chopping off the end of words)
    step_stem(complaint_narrative) |>
    # removing the stop words
    step_stopwords(complaint_narrative) |>
    # keep only the most importat 500 tokens
    step_tokenfilter(complaint_narrative,
                     max_tokens = 300) |>
    # term frequency–inverse document frequency
    step_tfidf(complaint_narrative) |>
    # downsampling in order to address the imbalance in levels
    step_downsample()

3.4 (Multinomial) logistic regression models

In this section the classification is done using multinomial regression. We will consider multiple options: ridge regression, lasso and elastic net.

In first instance, we consider the model with only one predictor, i.e. complaint_narrative (namely the feature corresponding to complaint_narrative, via the tf-idf) and then test if improvements in roc-auc are obtained by considering additional predictors: company and state.

We will test as well if balancing the data can improve the roc-auc.

3.4.1 Lasso: model, workflow and accuracy & roc-auc

In this section the lasso model is considered. The penalty term (lambda) is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).

Creation of the model:

library(glmnet)
library(doParallel)

# Creating the lasso model with tuning for the regularization paramter
lasso_model_tune <- multinom_reg(mode = "classification",
                                 engine = "glmnet",
                                 penalty = tune(), # regularization term
                                 mixture = 1 # LASSO
                                 )

Workflow, tune, cross-validation, metrics, best auc-roc:

# One predictor model simulations
# -------------------------------

# Creating the workflow for one predictor: lasso model and the one predictor recipe
lasso_wflow <- workflow() |>
    workflows::add_recipe(rec) |>
    workflows::add_model(lasso_model_tune)

# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)

# The results for the 10 candidates paramters
lasso_params <- tune_grid(lasso_wflow,
                          resamples = data_vfold,
                          grid = number_of_candidate_param
                          )
# Showing the best results
lasso_params |>
    show_best(metric = "roc_auc",
              n = 5) |>
    knitr::kable(caption = "Metrics for canditates one predictor lasso model")

Table 3.3: Metrics for canditates one predictor lasso model
penalty	.metric	.estimator	mean	n	std_err	.config
0.0000866	roc_auc	hand_till	0.9896293	20	0.0003081	Preprocessor1_Model12
0.0002082	roc_auc	hand_till	0.9896286	20	0.0003068	Preprocessor1_Model13
0.0000000	roc_auc	hand_till	0.9896207	20	0.0003053	Preprocessor1_Model01
0.0000000	roc_auc	hand_till	0.9896207	20	0.0003053	Preprocessor1_Model02
0.0000000	roc_auc	hand_till	0.9896207	20	0.0003053	Preprocessor1_Model03

# Get the best result
lasso_best_param <- lasso_params |>
    select_best(metric = "roc_auc")

# Finalize the workflow with the best corresponding parameter
lasso_wflow_tuned <- lasso_wflow |>
    tune::finalize_workflow(lasso_best_param)

# Fitting the tuned (best model)
lasso_tuned_fitted <- lasso_wflow_tuned |>
    last_fit(split = data_split) 

# Collecting the metrics of the best tunned parameter fit over all the available training set
lasso_tuned_metrics <- lasso_tuned_fitted |>
    collect_metrics()

# Show the metrics as a kable
lasso_tuned_metrics |>
    knitr::kable(caption = "Metrics for one predictor lasso tunned model")

Table 3.3: Metrics for one predictor lasso tunned model
.metric	.estimator	.estimate	.config
accuracy	multiclass	0.9407420	Preprocessor1_Model1
roc_auc	hand_till	0.9904114	Preprocessor1_Model1
brier_class	multiclass	0.0438535	Preprocessor1_Model1

all_tested_models_metrics <- lasso_tuned_metrics |>
    filter(.metric == "roc_auc") |>
    select(-.config) |>
    mutate(model = "lasso",
           name = "one_pred_lasso_tuned_fitted") |>
    relocate(model)

3.4.2 Ridge Regression: model, workflow & roc-auc

In this section the ridge regression model is considered. The penalty term (lambda) is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).

Creation of the model:

# Creating the lasso model with tuning for the regularization paramter
ridge_model_tune <- multinom_reg(mode = "classification",
                                 engine = "glmnet",
                                 penalty = tune(), # regularization term
                                 mixture = 0 # Ridge Regression
                                 )

Workflow, tune, cross-validation, metrics, best auc-roc:

# One predictor model simulations
# -------------------------------

# Creating the workflow for one predictor: ridge model and the one predictor recipe
ridge_wflow <- workflow() |>
    workflows::add_recipe(rec) |>
    workflows::add_model(ridge_model_tune)

# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)

# The results for the 10 candidates paramters
ridge_params <- tune_grid(ridge_wflow,
                          resamples = data_vfold,
                          grid = number_of_candidate_param
                          )

# Showing the best results
ridge_params |>
    show_best(metric = "roc_auc",
              n = 5) |>
    knitr::kable(caption = "Metrics for canditates one predictor ridge model")

Table 3.4: Metrics for canditates one predictor ridge model
.metric	.estimator	mean	n	std_err	.config
roc_auc	hand_till	0.9847506	20	0.00045	Preprocessor1_Model01
roc_auc	hand_till	0.9847506	20	0.00045	Preprocessor1_Model02
roc_auc	hand_till	0.9847506	20	0.00045	Preprocessor1_Model03
roc_auc	hand_till	0.9847506	20	0.00045	Preprocessor1_Model04
roc_auc	hand_till	0.9847506	20	0.00045	Preprocessor1_Model05

# Get the best result
ridge_best_param <- ridge_params |>
    select_best(metric = "roc_auc")

# Finalize the workflow with the best corresponding parameter
ridge_wflow_tuned <- ridge_wflow |>
    tune::finalize_workflow(ridge_best_param)

# Fitting the tuned (best model)
ridge_tuned_fitted <- ridge_wflow_tuned |>
  last_fit(split = data_split)

# Collecting the metrics of the best tunned parameter fit over all the available training set
ridge_tuned_metrics <- ridge_tuned_fitted |>
    collect_metrics() 

# Show the metrics as a kable
ridge_tuned_metrics |>
    knitr::kable(caption = "Metrics for one predictor ridge tunned model")

Table 3.4: Metrics for one predictor ridge tunned model
.metric	.estimator	.estimate	.config
accuracy	multiclass	0.9248475	Preprocessor1_Model1
roc_auc	hand_till	0.9849885	Preprocessor1_Model1
brier_class	multiclass	0.0635453	Preprocessor1_Model1

all_tested_models_metrics <- all_tested_models_metrics |>
    add_row(ridge_tuned_metrics |>
            filter(.metric == "roc_auc") |>
            select(-.config) |>
            mutate(model = "ridge",
                   name = "ridge_tuned_fitted")|>
            relocate(model))

3.4.3 Elasti Net Regression: model, workflow and accuracy & roc-auc

In this section the elastic net regression model is considered. The penalty term (lambda) and the mixture term is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).

Creation of the model:

library(glmnet)
library(doParallel)

# Creating the lasso model with tuning for the regularization paramter
elastic_net_model_tune <- multinom_reg(mode = "classification",
                                       engine = "glmnet",
                                       penalty = tune(), # regularization term
                                       mixture = tune() # Elastic Net
                                       )

Workflow, tune, cross-validation, metrics, best auc-roc:

# One predictor model simulations
# -------------------------------

# Creating the workflow for one predictor: elastic net model and the one predictor recipe
elastic_net_wflow <- workflow() |>
    workflows::add_recipe(rec) |>
    workflows::add_model(elastic_net_model_tune)

# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)

# The results for the 10 candidates paramters
elastic_net_params <- tune_grid(elastic_net_wflow,
                                resamples = data_vfold,
                                grid = number_of_candidate_param
                                )

# Showing the best results
elastic_net_params |>
    show_best(metric = "roc_auc",
              n = 5) |>
    knitr::kable(caption = "Metrics for canditates one predictor elastic-net model")

Table 3.5: Metrics for canditates one predictor elastic-net model
penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.0e+00	0.9732171	roc_auc	hand_till	0.9896174	20	0.0003056	Preprocessor1_Model20
5.8e-06	0.9481793	roc_auc	hand_till	0.9896153	20	0.0003057	Preprocessor1_Model19
1.0e-07	0.8470426	roc_auc	hand_till	0.9896034	20	0.0003073	Preprocessor1_Model17
1.0e-06	0.7894657	roc_auc	hand_till	0.9895972	20	0.0003076	Preprocessor1_Model16
0.0e+00	0.7249143	roc_auc	hand_till	0.9895895	20	0.0003080	Preprocessor1_Model15

# Get the best result
elastic_net_best_param <- elastic_net_params |>
     select_best(metric = "roc_auc")

# Finalize the workflow with the best corresponding parameter
elastic_net_wflow_tuned <- elastic_net_wflow |>
    tune::finalize_workflow(elastic_net_best_param)

# Fitting the tuned (best model)
elastic_net_tuned_fitted <- elastic_net_wflow_tuned |>
  last_fit(split = data_split)

# Collecting the metrics of the best tunned parameter fit over all the available training set
elastic_net_tuned_metrics <- elastic_net_tuned_fitted |>
    collect_metrics() 

# Show the metrics as a kable
elastic_net_tuned_metrics |>
    knitr::kable(caption = "Metrics for one predictor elasting-net tunned model")

Table 3.5: Metrics for one predictor elasting-net tunned model
.metric	.estimator	.estimate	.config
accuracy	multiclass	0.9405771	Preprocessor1_Model1
roc_auc	hand_till	0.9904704	Preprocessor1_Model1
brier_class	multiclass	0.0438204	Preprocessor1_Model1

all_tested_models_metrics <- all_tested_models_metrics |>
    add_row(elastic_net_tuned_metrics |>
            filter(.metric == "roc_auc") |>
            select(-.config) |>
            mutate(model = "elastic-net",
                   name = "elastic_net_tuned_fitted")|>
            relocate(model))

3.5 Random Forest: model, workflow and accuracy & roc-auc

In this section the random forest model is considered The number of predictors that will be randomly sampled at each split (mtry) and the minimum number of data points in a node that are required for the node to be split further (min_n) are tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv). The number of trees in the ensemble is fixed.

Creation of the model:

library(glmnet)
library(doParallel)

# Creating the random forest model with tuning for the
random_forest_model_tune <- rand_forest(mode = "classification",
                              engine = "ranger",
                              mtry = tune(),
                              min_n = tune(),
                              trees = 50)

Workflow, tune, cross-validation, metrics, best auc-roc:

# One predictor model simulations
# -------------------------------

# Creating the workflow for one predictor: random forest model and the one predictor recipe
random_forest_wflow <- workflow() |>
    workflows::add_recipe(rec) |>
    workflows::add_model(random_forest_model_tune)

# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)

# The results for the 10 candidates paramters
random_forest_params <- tune_grid(random_forest_wflow,
                                  resamples = data_vfold,
                                  grid = number_of_candidate_param
                                  )
# Showing the best results
random_forest_params |>
    show_best(metric = "roc_auc",
              n = 5) |>
    knitr::kable(caption = "Metrics for canditates one predictor random forest model")

Table 3.6: Metrics for canditates one predictor random forest model
mtry	min_n	.metric	.estimator	mean	n	std_err	.config
36	13	roc_auc	hand_till	0.9892384	20	0.0003555	Preprocessor1_Model11
24	5	roc_auc	hand_till	0.9890225	20	0.0003567	Preprocessor1_Model19
55	17	roc_auc	hand_till	0.9888831	20	0.0003399	Preprocessor1_Model03
71	39	roc_auc	hand_till	0.9887300	20	0.0003965	Preprocessor1_Model12
80	30	roc_auc	hand_till	0.9884943	20	0.0004114	Preprocessor1_Model01

# Get the best result
random_forest_best_param <- random_forest_params |>
     select_best(metric = "roc_auc")

# Finalize the workflow with the best corresponding parameter
random_forest_wflow_tuned <- random_forest_wflow |>
    tune::finalize_workflow(random_forest_best_param)

# Fitting the tuned (best model)
random_forest_tuned_fitted <- random_forest_wflow_tuned |>
  last_fit(split = data_split)

# Collecting the metrics of the best tunned parameter fit over all the available training set
random_forest_tuned_metrics <- random_forest_tuned_fitted |>
    collect_metrics() 

# Show the metrics as a kable
random_forest_tuned_metrics |>
    knitr::kable(caption = "Metrics for one predictor random forests tunned model")

Table 3.6: Metrics for one predictor random forests tunned model
.metric	.estimator	.estimate	.config
accuracy	multiclass	0.9399835	Preprocessor1_Model1
roc_auc	hand_till	0.9895933	Preprocessor1_Model1
brier_class	multiclass	0.0462004	Preprocessor1_Model1

all_tested_models_metrics <- all_tested_models_metrics |>
    add_row(random_forest_tuned_metrics |>
            filter(.metric == "roc_auc") |>
            select(-.config) |>
            mutate(model = "random forests",
                   name = "random_forest_tuned_fitted") |>
            relocate(model))

4 Roc-auc for each model

all_tested_models_metrics |>
    knitr::kable(caption = "Roc-auc for tunned (tested) models ")

Table 4.1: Roc-auc for tunned (tested) models
model	.metric	.estimator	.estimate	name
lasso	roc_auc	hand_till	0.9904114	one_pred_lasso_tuned_fitted
ridge	roc_auc	hand_till	0.9849885	ridge_tuned_fitted
elastic-net	roc_auc	hand_till	0.9904704	elastic_net_tuned_fitted
random forests	roc_auc	hand_till	0.9895933	random_forest_tuned_fitted

5 Confusion matrix for model with highest roc-auc

library(magrittr)
best_model <- all_tested_models_metrics |> 
    pull(name) |>
    extract2(which.max(all_tested_models_metrics$.estimate))

eval(as.name(best_model)) |>
    collect_predictions() |>
    conf_mat(.pred_class, product) |>
    autoplot(type = "heatmap") +
    scale_fill_gradient(low = "beige", high = "cadetblue")

Figure 5.1: Confustion Matrix for the Random Forest Model (the best tuned model)

6 Predictions

pred_lasso <- lasso_tuned_fitted |>
     extract_workflow() |>
     predict(test_tidy) |>
     pull(.pred_class)
# pred_lasso

pred_ridge <- ridge_tuned_fitted |>
     extract_workflow() |>
     predict(test_tidy) |>
     pull(.pred_class)
# pred_ridge

pred_elastic_net <- elastic_net_tuned_fitted |>
     extract_workflow() |>
     predict(test_tidy) |>
     pull(.pred_class)
# pred_elastic_net

pred_random_forest <- random_forest_tuned_fitted |>
     extract_workflow() |>
     predict(test_tidy) |>
     pull(.pred_class)
# pred_random_forest

all_tested_models_predictions <- tibble(
    complaint_id = 1:20,
    lasso = pred_lasso,
    ridge = pred_ridge,
    elastic_net = pred_elastic_net,
    random_forest = pred_random_forest,
)

library(kableExtra)
all_tested_models_predictions |>
    kbl(caption = "Predicted products for test set (tested tuned models, best one in term in auc-roc highlighted)") |>
    kable_paper(full_width = F) |>
    column_spec(which.max(all_tested_models_metrics$.estimate)+1,
                color = "black",
                background = "cadetblue")

Table 6.1: Predicted products for test set (tested tuned models, best one in term in auc-roc highlighted)
complaint_id	lasso	ridge	elastic_net	random_forest
1	Student loan	Student loan	Student loan	Vehicle loan or lease
2	Vehicle loan or lease	Vehicle loan or lease	Vehicle loan or lease	Vehicle loan or lease
3	Student loan	Student loan	Student loan	Student loan
4	Mortgage	Mortgage	Mortgage	Mortgage
5	Student loan	Credit card or prepaid card	Student loan	Vehicle loan or lease
6	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
7	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
8	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
9	Student loan	Student loan	Student loan	Student loan
10	Vehicle loan or lease	Credit card or prepaid card	Vehicle loan or lease	Vehicle loan or lease
11	Student loan	Student loan	Student loan	Student loan
12	Mortgage	Mortgage	Mortgage	Mortgage
13	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
14	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
15	Mortgage	Mortgage	Mortgage	Mortgage
16	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card	Credit card or prepaid card
17	Mortgage	Mortgage	Mortgage	Mortgage
18	Student loan	Student loan	Student loan	Student loan
19	Mortgage	Mortgage	Mortgage	Vehicle loan or lease
20	Mortgage	Mortgage	Mortgage	Mortgage

Modeling Data in the Tidyverse Project

Mircea Dumitru

2024-06-18

1 Reading and Cleaning the Data

2 Exploratory Data Analysis

2.1 The `product` outcome

2.2 The `company` variable

2.3 The `state` variable

2.4 The `complaint_narrative` variable

2.4.1 Cleaning the text

3 Model selection

3.1 The test data

3.2 Splitting the data

3.3 Making the recipe

3.4 (Multinomial) logistic regression models

3.4.1 Lasso: model, workflow and accuracy & roc-auc

3.4.2 Ridge Regression: model, workflow & roc-auc

3.4.3 Elasti Net Regression: model, workflow and accuracy & roc-auc

3.5 Random Forest: model, workflow and accuracy & roc-auc

4 Roc-auc for each model

5 Confusion matrix for model with highest roc-auc

6 Predictions

Modeling Data in the Tidyverse Project

Mircea Dumitru

2024-06-18

1 Reading and Cleaning the Data

2 Exploratory Data Analysis

2.1 The product outcome

2.2 The company variable

2.3 The state variable

2.4 The complaint_narrative variable

2.4.1 Cleaning the text

3 Model selection

3.1 The test data

3.2 Splitting the data

3.3 Making the recipe

3.4 (Multinomial) logistic regression models

3.4.1 Lasso: model, workflow and accuracy & roc-auc

3.4.2 Ridge Regression: model, workflow & roc-auc

3.4.3 Elasti Net Regression: model, workflow and accuracy & roc-auc

3.5 Random Forest: model, workflow and accuracy & roc-auc

4 Roc-auc for each model

5 Confusion matrix for model with highest roc-auc

6 Predictions

2.1 The `product` outcome

2.2 The `company` variable

2.3 The `state` variable

2.4 The `complaint_narrative` variable