library(tidyverse)
library(tidymodels)
library(here)
library(janitor)
train <- read_csv(here::here("data", "tidy_data", "data_complaints_train.csv"))
library(skimr)
#glimpse(train)
#library(visdat)
#vis_dat(train)
#vis_miss(train)
skim(train)
| Name | train |
| Number of rows | 90975 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Product | 0 | 1 | 8 | 27 | 0 | 4 | 0 |
| Consumer complaint narrative | 0 | 1 | 12 | 32317 | 0 | 90240 | 0 |
| Company | 0 | 1 | 4 | 82 | 0 | 1497 | 0 |
| State | 0 | 1 | 2 | 36 | 0 | 62 | 0 |
| ZIP code | 0 | 1 | 4 | 5 | 0 | 5484 | 0 |
| Submitted via | 0 | 1 | 3 | 3 | 0 | 1 | 0 |
There is no missing data, all columns have a complete rate. The first observations from skimming (using the skim() function from skimr package) are the following:
The names of variables should be cleaned, in order to respect the tidy philosophy for variable names. This will be done using clean_names().
The variable Submitted via has one unique value, hence it injects no information to the data and it shouldn’t be used a predictor. The variable zip_code is redundant: many of the values corresponding to this variable have unavailable last two digits, which gives granularity with respect to the geographical region within a state so effectively the only information available is the one corresponding to the state, which is only contained in the state variable. Both variables Submitted via and zip_code will be dropped.
For simplicity, the variable consumer_ complaint_narrative will be renamed to complaint_narrative.
train_tidy <- train |>
clean_names() |>
select(-c(submitted_via, zip_code)) |>
rename(complaint_narrative = consumer_complaint_narrative) |>
mutate(product = factor(product))
#train_tidy
For classifying the type of complaint (i.e. the variable product) based on the available potential predictors, the main interest will be the variable complaint_narrative. Since it contains the complaint, it can be mined for words that can inject information about the potential value of the complaint category (product).
We will first explore the other potential predictors, company, state and zip_code in order to determine if they can be used for the model. This will be done by investigating if the potential predictor can inject any information with respect to the outcome variable.
We will start by first looking at the outcome variable. There are no missing data and we know it’s a categorical variable with 4 values. We will to check if there is an unbalance in the data.
product outcomeIn this section the outcome is explored, i.e. the product variable. The goal here is to check if there is an unbalance in the variable, i.e. if the outcome’s levels are unbalanced. We do this by looking at the distribution of the product variable.
library(viridis)
par(mar = c(4, 4, .1, .1))
train_tidy |>
count(product) |>
mutate(product_freq = n / sum(n)) |>
ggplot(aes(x = fct_reorder(product, product_freq),
y = product_freq)) +
geom_col(aes(fill = n)) +
geom_text(aes(label = scales::percent(product_freq)),
vjust=-.5, size=4, fontface = "bold") +
labs(x = "Product",
y = "Product Frequency",
fill = "Product Counts") +
scale_fill_viridis(labels = scales::unit_format(unit = "k",
scale = 1e-3)) +
theme_bw()
Figure 2.1: Percentage and counts of complaints over product levels. The counts are showed in the legend.
The data is unbalanced. However, the level that is the less represented, Vehicle loan or lease, representing only 10.2% of the total observations, has roughly 10k corresponding observations, hence the unbalance in the data might not be an issue for the classification. However, this can be addressed easily by including an additional step in the recipe and downsampling the data in order to assure equal numbers between the (four) levels. The training set will be downsampled in order for model to be trained on balanced data.
company variableIn this section the company variable is explored. There are 90,975 available observations and 1497 companies. If most of the companies are associated with all (or multiple) types of products, then not much information can be injected in the model using this variable, since there is no discrimination between the companies with respect to their corresponding product. However, if there are companies that provide exclusively one product (i.e. one specific value corresponding to the product) then a new variable (feature) can be created, coding the corresponding product value for the companies that have exclusively one product and codding the companies that handle multiple products with another code. This can be informative and improve the classification roc-auc provided the number of observations in the data set corresponding to the companies offering only one product is significant.
A new feature based on the company variable is created, company_profile with five levels, four levels corresponding to the unique type of product they are associated with and the fifth indicating the company is associated with multiple products. We look at the distribution of this new variable and at the distribution of the complaints conditional on this new variable.
product_levels <- unique(train_tidy$product)
#product_levels
company_by_product <- train_tidy |>
group_by(product) |>
distinct(company)
#company_by_product
credit <- company_by_product |>
filter(product == product_levels[1]) |>
pull(company)
mortage <- company_by_product |>
filter(product == product_levels[2]) |>
pull(company)
student <- company_by_product |>
filter(product == product_levels[3]) |>
pull(company)
vehicle <- company_by_product |>
filter(product == product_levels[4]) |>
pull(company)
only_credit_companies <- list(credit, mortage, student, vehicle) |>
reduce(setdiff)
only_mortage_companies <- list(mortage, credit, student, vehicle) |>
reduce(setdiff)
only_student_companies <- list(student, credit, mortage, vehicle) |>
reduce(setdiff)
only_vehicle_companies <- list(vehicle, credit, mortage, student) |>
reduce(setdiff)
Getting the number of companies associated exclusively with one type (level) of service:
list(only_credit_companies, only_mortage_companies,
only_student_companies, only_vehicle_companies) |>
map(length) |>
unlist() |>
sum()
## [1] 1293
Between the 1497 unique values of the variable company, 1293 are associated exclusively to one type of product (a unique value corresponding to the product variable), i.e. roughly 86% of the companies handle only one type of product, hence the idea of potentially using a new feature based on this variable in order to inject information and increase the classification roc-auc seems reasonable. However, this depends on the number of complaints (observations) in the data set that correspond to companies handling exclusively one product.
train_tidy <- train_tidy |>
mutate(company_profile = case_when(company %in% only_credit_companies ~ "credit",
company %in% only_mortage_companies ~ "mortage",
company %in% only_student_companies ~ "student",
company %in% only_vehicle_companies ~ "vehicle",
TRUE ~ "multiple") |>
factor())
#train_tidy
We look at the distribution of the company profile distribution and at the distribution of the complaints over the five profile levels of company_profile.
par(mar = c(4, 4, .1, .1))
# Bar plot number of companies per company profile
train_tidy |>
group_by(company_profile) |>
summarise(number_of_companies = length(unique(company))) |>
mutate(frequency = number_of_companies / sum(number_of_companies)) |>
ggplot(aes(x = fct_reorder(company_profile, frequency),
y = frequency)) +
geom_col(aes(fill = number_of_companies)) +
geom_text(aes(label = scales::percent(frequency)),
vjust = -.5, size = 4, fontface = "bold") +
labs(x = "Product Category",
y = "Companies Frequency",
fill = "Company Counts") +
scale_fill_viridis() +
theme_bw()
# Bar plot number of complaints (observations) per company profile
train_tidy |>
count(company_profile) |>
mutate(company_profile_freq = n / sum(n)) |>
ggplot(aes(x = fct_reorder(company_profile, company_profile_freq),
y = company_profile_freq)) +
geom_col(aes(fill = n)) +
geom_text(aes(label = scales::percent(company_profile_freq)),
vjust = -.5, size = 4, fontface = "bold") +
labs(x = "Companies profile",
y = "Complaints Frequency",
fill = "Complaints Counts") +
scale_fill_viridis(labels = scales::unit_format(unit = "k",
scale = 1e-3)) +
theme_bw()
Figure 2.2: Percentage of companies (left) and complaints (right) by product category. The counts are showed in the corresponding legends.
The bar plots show that only 13.6% of the companies provide multiple type of products, but are associated with more than 80% of the complaints. The new feature company_profile - created based on the company variable - might inject, some amount of information in the classification model in order to improve the roc-auc hence:
company_profile will be kept as a model predictor.company is dropped.train_tidy <- train_tidy |>
select(-c(company))
state variableThe state variable has 62 levels, corresponding to the abbreviations of the 50 states, together with DC (District of Columbia), PR (Puerto Rico), GU (Guam), VI (Virgin Islands), AS (American Samoa), FM (Federated States of Micronesia), MH (Marshall Islands)
UNITED STATES MINOR OUTLYING ISLANDS, AE (armed forces in Europe, the Middle East, Africa, and Canada), AP armed forces in the Pacific), AA (armed forces in the Americas excluding Canada) and None.
We first change the value UNITED STATES MINOR OUTLYING ISLANDS to OUT in the state variable.
train_tidy <- train_tidy |>
mutate(state = case_match(state, "UNITED STATES MINOR OUTLYING ISLANDS" ~ "OUT",
.default = state))
We check if there is a visible variation of the frequencies of complaints corresponding to each of the four levels of the product within stats, indicating that the state variable can be informative for the model. The frequencies are not relevant if the absolute number of complaints corresponding to a zip code is small. This is expected to be the case with the abbreviations corresponding to the territories outside the 50 states and DC, so first we check the number of complaints corresponding to these zip codes.
outer_territories = setdiff(unique(train_tidy$state), c(state.abb, "DC"))
train_tidy |>
filter(state %in% outer_territories) |>
pull(state) |>
table() |>
t() |>
knitr::kable(caption = "Number of observations per outer territories")
| AA | AE | AP | AS | FM | GU | MH | None | OUT | PR | VI |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 45 | 30 | 2 | 8 | 11 | 1 | 349 | 15 | 192 | 11 |
Setting the threshold at 50, we remove the values AA, AE, AP, AS, FM, GU, MH and VI.
train_tidy |>
filter(!state %in% c("AA", "AE", "AP", "AS", "FM", "GU", "MH", "VI")) |>
group_by(state, product) |>
summarise(counts = n()) |>
mutate(freq = counts/sum(counts)) |>
ungroup() |>
complete(state, product, fill = list(counts = 0, freq = 0)) |>
ggplot(aes(x = state, y = freq, color = product)) +
geom_point()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.position="top",
axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=1)) +
labs(x = "States",
y = "Frequency")
Figure 2.3: Frequency of outcome (complaints) levels per each state (only states/regions with more than 50 complaints per state/region).
The frequencies between the four levels of complaints seem to be fairly constant across states, hence
state variable will be dropped.train_tidy <- train_tidy |>
select(-state)
complaint_narrative variableIn this section the complaint_narrative variable is explored. Clearly, this variable is the most informative since text mining can be used in order to search for words that are predictive to each of the four levels of the product variable and new predictor(s) can be created that will be informative.
We check the first two observations corresponding to the complaint_narrative variable.
train_tidy |>
head(2) |>
pull(complaint_narrative)
## [1] "I initially in writing to Chase Bank in late XXXX about a charge that was unauthorized on XXXX from XXXX in the amount of {$790.00} after many letters and complaints they still did not do anything yet in XXXX they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost 8 months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in XXXX XXXX XXXX XXXX against them"
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with XXXX XXXX back in XX/XX/XXXX. I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in XX/XX/XXXX ( she is now XXXX of age ) as this happened years ago. \n\nNow after 21 years, XXXX XXXX XXXX XXXX filled a XXXX XXXX to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS XX/XX/XXXX which stated that due to DITECH Mortgage sending a XXXX with taxable income of {$17000.00}, this could cause me to have to owe the IRS {$2900.00}. IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the XXXX in order ( so it is not XXXX, XXXX taxable income ). \n\nI contacted DITECH at XXXX spoke with a \" XXXX '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of XXXX XXXX of a debt from a mortgage that I had back in XX/XX/XXXX with DITECH nor did I ever receive a XXXX from them. I asked them to rescind the XXXX ( taxable income of {$17000.00} ) as this is a debt from \" 20 Years ago '' in which I should not be taxed for taxable income of XXXX, XXXX on a debt from 20 years ago. However, XXXX refused to rescind ( reverse ) the XXXX. \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the 7 to 10 year period that the debt is lawfully removed from my credit. They can't go back 20 years later AND IN XX/XX/XXXX send the IRS and myself a XXXX form to tax me {$17000.00} for a debt from over 20 years ago that WAS charged off my credit report back in the year XX/XX/XXXX!!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over 20 years now and it has negatively affected my credit for way over the 7 year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT {$17000.00} then they should have taxed me 20 yrs ago ( BEFORE THE 7 TO 10 YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a XXXX taxable income, they should have done this YEARS AGO NOT IN XX/XX/XXXX. I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE 20 YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN XX/XX/XXXX ON A DEBT FROM THE YEAR XX/XX/XXXX"
The text “XXXX” appears multiple times in most of the complaints. It will be removed. Also, there are multiple characters that can be removed without losing information: dates, digits, punctuation and breaks. Everything is transformed to lower case.
train_tidy <- train_tidy |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"X(X+)|x(x+)")) |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"..\\/..\\/....")) |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"\\{\\$[0-9\\.]*\\}")) |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"\n_*\n")) |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"(_+\\s?)+")) |>
mutate(complaint_narrative = str_remove_all(complaint_narrative,
"[0-9]+")) |>
mutate(complaint_narrative = tolower(complaint_narrative))
train_tidy |>
head(2) |>
pull(complaint_narrative)
## [1] "i initially in writing to chase bank in late about a charge that was unauthorized on from in the amount of after many letters and complaints they still did not do anything yet in they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost months later it is still on my statement and i'm getting charged interest. even after this i still use my card and pay for my charges and have never been late however i will not pay for a charge i didn't make and refuse and chase refuses to do anything even after many attempts and letters they write stating so. i feel this is last step before i begin a case in against them"
## [2] "my ex husband and myself had a mobile home ( home mortgage ) with back in //. i went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. the mortgage was in both of our names. we only had the mortgage for one year and they got the trailer bc of being unable to pay debt. i was not working at the time as i had just given birth to my first child in // ( she is now of age ) as this happened years ago. now after years, filled a to the irs and the irs alerted me which caused me an adverse affect on my irs taxes. \ni received a l letter from the irs // which stated that due to ditech mortgage sending a with taxable income of , this could cause me to have to owe the irs . irs told me to contact ditech financial to dispute this and request that ditech mortgage rescind the in order ( so it is not , taxable income ). i contacted ditech at spoke with a \" '' and i told her that i did not want to be recorded but she did it anyways. \ni told her that i never settled on an amount of of a debt from a mortgage that i had back in // with ditech nor did i ever receive a from them. i asked them to rescind the ( taxable income of ) as this is a debt from \" years ago '' in which i should not be taxed for taxable income of , on a debt from years ago. however, refused to rescind ( reverse ) the . \nnote, i have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the to year period that the debt is lawfully removed from my credit. they can't go back years later and in // send the irs and myself a form to tax me for a debt from over years ago that was charged off my credit report back in the year //!!! this is very unfair and not allowed. \nalso< i disputing the fact that they kept putting this on and off my credit for over years now and it has negatively affected my credit for way over the year period. \npoint being< if they had plans of taxing me at then they should have taxed me yrs ago ( before the to year period passed which is now way over ). if they were going to send a taxable income, they should have done this years ago not in //. i am requesting a complaint submitted as they have unfairly affected my credit past the year period and it is violating my rights to be taxed for something in // on a debt from the year //"
We tokenize the data and, count and sort the words and remove the stopping words.
library(tidytext)
data(stop_words)
train_tidy_tok <- train_tidy |>
unnest_tokens(word,
complaint_narrative) |>
count(product, word,
sort = TRUE) |>
anti_join(stop_words,
by = join_by(word))
train_tidy_tok |>
head(5) |>
knitr::kable(caption = "Tokenized words (word) in the training set sorted by their counts (n) together with the corresponding level in the outcome variable.")
| product | word | n |
|---|---|---|
| Credit card or prepaid card | card | 91717 |
| Credit card or prepaid card | credit | 86691 |
| Mortgage | mortgage | 72709 |
| Mortgage | loan | 72184 |
| Credit card or prepaid card | account | 69050 |
The distribution of the first 20 most frequent words contained in the complaints, for each of the product:
train_tidy_tok |>
group_by(product) |>
slice_max(n, n = 20) |>
ggplot(aes(x = reorder_within(word, n, product),
y = n,
fill = product)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(. ~ product, scales = "free_y") +
theme_bw() +
labs(x = "", y = "")
Figure 2.4: Words in complaints sorted by their frequency, conditional on the product level. There is a visible overlap between the levels in the outcome variable.
We notice a significant overlap between the most frequent words in the four product levels. For instance, between the most frequent 20 words between the four product levels, 8 words are common.
train_tidy_tok |>
group_by(product) |>
slice_max(n, n = 20) |>
summarise(frequent_words = unique(list(word))) |>
pull(frequent_words) |>
reduce(intersect) |>
t() |>
knitr::kable(caption = "Common words between the most 20 frequent words in the complaint four levels ")
| account | payment | told | called | received | time | due | pay |
And between the most frequent 50 words between the four product levels, 23 words are common.
train_tidy_tok |>
group_by(product) |>
slice_max(n, n = 50) |>
summarise(frequent_words = unique(list(word))) |>
pull(frequent_words) |>
reduce(intersect) |>
length()
## [1] 23
Creating features by using as measure of importance the words frequency won’t inject much useful information, since there is a significant overlap between the classes. We consider as a measure of importance of words in the complaints the term frequency–inverse document frequency.
train_tfidf<- train_tidy_tok |>
bind_tf_idf(term = word,
document = product,
n = n)
train_tfidf |>
arrange(desc(tf_idf)) |>
head(10) |>
knitr::kable(caption = "Sorted tokenized words in the training set")
| product | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| Student loan | navient | 13432 | 0.0133355 | 0.6931472 | 0.0092435 |
| Student loan | fedloan | 1769 | 0.0017563 | 1.3862944 | 0.0024347 |
| Credit card or prepaid card | amex | 4401 | 0.0014585 | 1.3862944 | 0.0020219 |
| Student loan | pslf | 2303 | 0.0022865 | 0.6931472 | 0.0015848 |
| Student loan | nelnet | 2258 | 0.0022418 | 0.6931472 | 0.0015539 |
| Mortgage | phh | 3767 | 0.0011085 | 1.3862944 | 0.0015367 |
| Vehicle loan or lease | gm | 1608 | 0.0021144 | 0.6931472 | 0.0014656 |
| Mortgage | shellpoint | 3552 | 0.0010452 | 1.3862944 | 0.0014490 |
| Mortgage | loancare | 3359 | 0.0009885 | 1.3862944 | 0.0013703 |
| Mortgage | sps | 3314 | 0.0009752 | 1.3862944 | 0.0013519 |
train_tfidf |>
group_by(product) |>
slice_max(tf_idf, n = 20) |>
ggplot(aes(x = reorder_within(word, tf_idf, product),
y = tf_idf,
fill = product)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(. ~ product, scales = "free_y") +
theme_bw() +
labs(x = "", y = "")
Figure 2.5: Words in complaints sorted by tg-idf conditional on the product level, with no overlap between outocome levels.
In this section multiple classification models are tested. Namely, multiclass logistic regression models (ridge, lasso, elastic net) and random forests.
For all model classes considered (e.g. ridge, lasso, random forests), the parameters are tuned using a grid search. The number of candidates parameters is set to a constant value for tuning all considered models.
number_of_candidate_param = 20
For each model class (e.g. ridge, lasso, random forests) the model selection is done using k-folds cross-validation. The number of partitions (folds) is set to a constant value for model selection for all considered models.
number_of_partitions = 20
In order to compare the considered models:
In the first section the testing data is transformed using the same procedure used for the training data (i.e. factor transformation for the output variable, dropping all the columns that are not predictive and won’t be included in model). The second section shows the two recipes (corresponding to one and two predictors models) considered during the simulations. Then next sections are dedicated to the different models that are tested. In the last section, all the tested models and their associated roc-auc scores, corresponding to the tuned parameters - found via the grid search - and selected via 10-fold cross validation are presented.
In the exploratory data analysis we established that:
the outcome product has to be transformed into a factor and the recipe should include a downsample step in order to account for the inbalance in the data, see Figure 2.1.
the submitted_via variable can’t inject any information (having an unique value) hence it is dropped.
the zip_code variable does not inject any information since it’s containing effectively the same information as the state variable hence it is dropped. (This is due to the fact that for most observations the zip_code has the last two digits unavailable.)
the state doesn’t seem predictive for the outcome levels since each outcome level has a fairly constant frequency across states (i.e. each outcome level is uniformly distributed across the states), see Figure 2.3. It is dropped as a potential predictor.
the company variable is dropped.
a new (categorical) feature (derived from company) with five levels company_profile it is retained. How the complaints are distributed over the company_profile levels can be seen in Figure 2.2.
the complaint_narrative is retained and is used as (the main) predictor, using it to derive features from it via a term frequency–inverse document frequency measure.
The number of columns in the training set, after dropping the columns that are not predictive is 3.
product (the outcome),complaint_narrative (predictor via tf-idf),company_profile (predictor via one-hot encoding)train_tidy |>
head(1) |>
knitr::kable(caption = "Train set")
| product | complaint_narrative | company_profile |
|---|---|---|
| Credit card or prepaid card | i initially in writing to chase bank in late about a charge that was unauthorized on from in the amount of after many letters and complaints they still did not do anything yet in they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost months later it is still on my statement and i’m getting charged interest. even after this i still use my card and pay for my charges and have never been late however i will not pay for a charge i didn’t make and refuse and chase refuses to do anything even after many attempts and letters they write stating so. i feel this is last step before i begin a case in against them | multiple |
Both predictors are used in the model indirectly, i.e. in the recipe new features will be constructed from the complaint_narrative using the tf-idf and from the company_profile using one hot encoding.
The same transformation (i.e. factorizing the outcome, dropping submitted_via, zip_code, company adding the company_profile) are applied on the testing set.
test <- read_csv(here::here("data", "tidy_data", "data_complaints_test.csv"))
#head(test_set)
test_tidy <- test |>
clean_names() |>
select(-c(problem_id, submitted_via, zip_code)) |>
rename(complaint_narrative = consumer_complaint_narrative)
test_tidy <- test_tidy |>
mutate(company_profile = case_when(company %in% only_credit_companies ~ "credit",
company %in% only_mortage_companies ~ "mortage",
company %in% only_student_companies ~ "student",
company %in% only_vehicle_companies ~ "vehicle",
TRUE ~ "multiple") |>
factor())
test_tidy <- test_tidy |>
select(-c(company, state)) |>
mutate()
test_tidy |>
head(1) |>
knitr::kable(caption = "Test set")
| complaint_narrative | company_profile |
|---|---|
| I have multiple lateness/missed payments on my credit report due to ECMC not properly contacting me to inform me when payments were due. I never received any statements in regards to my loan. It was also a period in time when my loan was deferred so there should not be any late/ missed payments that is being reporting on my credit file. | multiple |
We split the data via the initial_split, extract the traing and testing data we also prepare the data_vfold for the cross validation, using 10 folds.
set.seed(42)
data_split <- initial_split(train_tidy, prop = 2/3)
data_split
## <Training/Testing/Total>
## <60650/30325/90975>
data_train <- training(data_split)
data_test <- testing(data_split)
data_vfold <- rsample::vfold_cv(data = data_train,
v = number_of_partitions)
As mentioned in the EDA section, we test for the two versions of models in terms of the predictors used:
complaint_narrative variable.company_profile variable.For the complaint_narrative variable, it first tokenized (using step_tokenize from textrecipes), the stop words are removed, then used for creating new feature using the term frequency–inverse document frequency.
library(textrecipes)
library(themis)
# The two predictor recipe, using both complaint_narative and company_profile variable
rec <- data_train |>
recipe(product ~ .) |>
# One hot encoding for the factor variable
step_dummy(company_profile,
one_hot = TRUE) |>
# tokenization
step_tokenize(complaint_narrative) |>
# stemming (chopping off the end of words)
step_stem(complaint_narrative) |>
# removing the stop words
step_stopwords(complaint_narrative) |>
# keep only the most importat 500 tokens
step_tokenfilter(complaint_narrative,
max_tokens = 300) |>
# term frequency–inverse document frequency
step_tfidf(complaint_narrative) |>
# downsampling in order to address the imbalance in levels
step_downsample()
In this section the classification is done using multinomial regression. We will consider multiple options: ridge regression, lasso and elastic net.
In first instance, we consider the model with only one predictor, i.e. complaint_narrative (namely the feature corresponding to complaint_narrative, via the tf-idf) and then test if improvements in roc-auc are obtained by considering additional predictors: company and state.
We will test as well if balancing the data can improve the roc-auc.
In this section the lasso model is considered. The penalty term (lambda) is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).
Creation of the model:
library(glmnet)
library(doParallel)
# Creating the lasso model with tuning for the regularization paramter
lasso_model_tune <- multinom_reg(mode = "classification",
engine = "glmnet",
penalty = tune(), # regularization term
mixture = 1 # LASSO
)
Workflow, tune, cross-validation, metrics, best auc-roc:
# One predictor model simulations
# -------------------------------
# Creating the workflow for one predictor: lasso model and the one predictor recipe
lasso_wflow <- workflow() |>
workflows::add_recipe(rec) |>
workflows::add_model(lasso_model_tune)
# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)
# The results for the 10 candidates paramters
lasso_params <- tune_grid(lasso_wflow,
resamples = data_vfold,
grid = number_of_candidate_param
)
# Showing the best results
lasso_params |>
show_best(metric = "roc_auc",
n = 5) |>
knitr::kable(caption = "Metrics for canditates one predictor lasso model")
| penalty | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| 0.0000866 | roc_auc | hand_till | 0.9896293 | 20 | 0.0003081 | Preprocessor1_Model12 |
| 0.0002082 | roc_auc | hand_till | 0.9896286 | 20 | 0.0003068 | Preprocessor1_Model13 |
| 0.0000000 | roc_auc | hand_till | 0.9896207 | 20 | 0.0003053 | Preprocessor1_Model01 |
| 0.0000000 | roc_auc | hand_till | 0.9896207 | 20 | 0.0003053 | Preprocessor1_Model02 |
| 0.0000000 | roc_auc | hand_till | 0.9896207 | 20 | 0.0003053 | Preprocessor1_Model03 |
# Get the best result
lasso_best_param <- lasso_params |>
select_best(metric = "roc_auc")
# Finalize the workflow with the best corresponding parameter
lasso_wflow_tuned <- lasso_wflow |>
tune::finalize_workflow(lasso_best_param)
# Fitting the tuned (best model)
lasso_tuned_fitted <- lasso_wflow_tuned |>
last_fit(split = data_split)
# Collecting the metrics of the best tunned parameter fit over all the available training set
lasso_tuned_metrics <- lasso_tuned_fitted |>
collect_metrics()
# Show the metrics as a kable
lasso_tuned_metrics |>
knitr::kable(caption = "Metrics for one predictor lasso tunned model")
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | multiclass | 0.9407420 | Preprocessor1_Model1 |
| roc_auc | hand_till | 0.9904114 | Preprocessor1_Model1 |
| brier_class | multiclass | 0.0438535 | Preprocessor1_Model1 |
all_tested_models_metrics <- lasso_tuned_metrics |>
filter(.metric == "roc_auc") |>
select(-.config) |>
mutate(model = "lasso",
name = "one_pred_lasso_tuned_fitted") |>
relocate(model)
In this section the ridge regression model is considered. The penalty term (lambda) is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).
Creation of the model:
# Creating the lasso model with tuning for the regularization paramter
ridge_model_tune <- multinom_reg(mode = "classification",
engine = "glmnet",
penalty = tune(), # regularization term
mixture = 0 # Ridge Regression
)
Workflow, tune, cross-validation, metrics, best auc-roc:
# One predictor model simulations
# -------------------------------
# Creating the workflow for one predictor: ridge model and the one predictor recipe
ridge_wflow <- workflow() |>
workflows::add_recipe(rec) |>
workflows::add_model(ridge_model_tune)
# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)
# The results for the 10 candidates paramters
ridge_params <- tune_grid(ridge_wflow,
resamples = data_vfold,
grid = number_of_candidate_param
)
# Showing the best results
ridge_params |>
show_best(metric = "roc_auc",
n = 5) |>
knitr::kable(caption = "Metrics for canditates one predictor ridge model")
| penalty | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| 0 | roc_auc | hand_till | 0.9847506 | 20 | 0.00045 | Preprocessor1_Model01 |
| 0 | roc_auc | hand_till | 0.9847506 | 20 | 0.00045 | Preprocessor1_Model02 |
| 0 | roc_auc | hand_till | 0.9847506 | 20 | 0.00045 | Preprocessor1_Model03 |
| 0 | roc_auc | hand_till | 0.9847506 | 20 | 0.00045 | Preprocessor1_Model04 |
| 0 | roc_auc | hand_till | 0.9847506 | 20 | 0.00045 | Preprocessor1_Model05 |
# Get the best result
ridge_best_param <- ridge_params |>
select_best(metric = "roc_auc")
# Finalize the workflow with the best corresponding parameter
ridge_wflow_tuned <- ridge_wflow |>
tune::finalize_workflow(ridge_best_param)
# Fitting the tuned (best model)
ridge_tuned_fitted <- ridge_wflow_tuned |>
last_fit(split = data_split)
# Collecting the metrics of the best tunned parameter fit over all the available training set
ridge_tuned_metrics <- ridge_tuned_fitted |>
collect_metrics()
# Show the metrics as a kable
ridge_tuned_metrics |>
knitr::kable(caption = "Metrics for one predictor ridge tunned model")
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | multiclass | 0.9248475 | Preprocessor1_Model1 |
| roc_auc | hand_till | 0.9849885 | Preprocessor1_Model1 |
| brier_class | multiclass | 0.0635453 | Preprocessor1_Model1 |
all_tested_models_metrics <- all_tested_models_metrics |>
add_row(ridge_tuned_metrics |>
filter(.metric == "roc_auc") |>
select(-.config) |>
mutate(model = "ridge",
name = "ridge_tuned_fitted")|>
relocate(model))
In this section the elastic net regression model is considered. The penalty term (lambda) and the mixture term is tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv).
Creation of the model:
library(glmnet)
library(doParallel)
# Creating the lasso model with tuning for the regularization paramter
elastic_net_model_tune <- multinom_reg(mode = "classification",
engine = "glmnet",
penalty = tune(), # regularization term
mixture = tune() # Elastic Net
)
Workflow, tune, cross-validation, metrics, best auc-roc:
# One predictor model simulations
# -------------------------------
# Creating the workflow for one predictor: elastic net model and the one predictor recipe
elastic_net_wflow <- workflow() |>
workflows::add_recipe(rec) |>
workflows::add_model(elastic_net_model_tune)
# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)
# The results for the 10 candidates paramters
elastic_net_params <- tune_grid(elastic_net_wflow,
resamples = data_vfold,
grid = number_of_candidate_param
)
# Showing the best results
elastic_net_params |>
show_best(metric = "roc_auc",
n = 5) |>
knitr::kable(caption = "Metrics for canditates one predictor elastic-net model")
| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0.0e+00 | 0.9732171 | roc_auc | hand_till | 0.9896174 | 20 | 0.0003056 | Preprocessor1_Model20 |
| 5.8e-06 | 0.9481793 | roc_auc | hand_till | 0.9896153 | 20 | 0.0003057 | Preprocessor1_Model19 |
| 1.0e-07 | 0.8470426 | roc_auc | hand_till | 0.9896034 | 20 | 0.0003073 | Preprocessor1_Model17 |
| 1.0e-06 | 0.7894657 | roc_auc | hand_till | 0.9895972 | 20 | 0.0003076 | Preprocessor1_Model16 |
| 0.0e+00 | 0.7249143 | roc_auc | hand_till | 0.9895895 | 20 | 0.0003080 | Preprocessor1_Model15 |
# Get the best result
elastic_net_best_param <- elastic_net_params |>
select_best(metric = "roc_auc")
# Finalize the workflow with the best corresponding parameter
elastic_net_wflow_tuned <- elastic_net_wflow |>
tune::finalize_workflow(elastic_net_best_param)
# Fitting the tuned (best model)
elastic_net_tuned_fitted <- elastic_net_wflow_tuned |>
last_fit(split = data_split)
# Collecting the metrics of the best tunned parameter fit over all the available training set
elastic_net_tuned_metrics <- elastic_net_tuned_fitted |>
collect_metrics()
# Show the metrics as a kable
elastic_net_tuned_metrics |>
knitr::kable(caption = "Metrics for one predictor elasting-net tunned model")
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | multiclass | 0.9405771 | Preprocessor1_Model1 |
| roc_auc | hand_till | 0.9904704 | Preprocessor1_Model1 |
| brier_class | multiclass | 0.0438204 | Preprocessor1_Model1 |
all_tested_models_metrics <- all_tested_models_metrics |>
add_row(elastic_net_tuned_metrics |>
filter(.metric == "roc_auc") |>
select(-.config) |>
mutate(model = "elastic-net",
name = "elastic_net_tuned_fitted")|>
relocate(model))
In this section the random forest model is considered
The number of predictors that will be randomly sampled at each split (mtry) and the minimum number of data points in a node that are required for the node to be split further (min_n) are tuned using a grid search with number_of_candidate_param candidates parameters (i.e. grid = number_of_candidate_param) over number_of_partitions folds cross validation (i.e. v = number_of_partitions in the vfold_cv). The number of trees in the ensemble is fixed.
Creation of the model:
library(glmnet)
library(doParallel)
# Creating the random forest model with tuning for the
random_forest_model_tune <- rand_forest(mode = "classification",
engine = "ranger",
mtry = tune(),
min_n = tune(),
trees = 50)
Workflow, tune, cross-validation, metrics, best auc-roc:
# One predictor model simulations
# -------------------------------
# Creating the workflow for one predictor: random forest model and the one predictor recipe
random_forest_wflow <- workflow() |>
workflows::add_recipe(rec) |>
workflows::add_model(random_forest_model_tune)
# Parallel computing for the the grid search procedure
doParallel::registerDoParallel(parallel::detectCores())
# setting a seed for reproduciblity
set.seed(42)
# The results for the 10 candidates paramters
random_forest_params <- tune_grid(random_forest_wflow,
resamples = data_vfold,
grid = number_of_candidate_param
)
# Showing the best results
random_forest_params |>
show_best(metric = "roc_auc",
n = 5) |>
knitr::kable(caption = "Metrics for canditates one predictor random forest model")
| mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 36 | 13 | roc_auc | hand_till | 0.9892384 | 20 | 0.0003555 | Preprocessor1_Model11 |
| 24 | 5 | roc_auc | hand_till | 0.9890225 | 20 | 0.0003567 | Preprocessor1_Model19 |
| 55 | 17 | roc_auc | hand_till | 0.9888831 | 20 | 0.0003399 | Preprocessor1_Model03 |
| 71 | 39 | roc_auc | hand_till | 0.9887300 | 20 | 0.0003965 | Preprocessor1_Model12 |
| 80 | 30 | roc_auc | hand_till | 0.9884943 | 20 | 0.0004114 | Preprocessor1_Model01 |
# Get the best result
random_forest_best_param <- random_forest_params |>
select_best(metric = "roc_auc")
# Finalize the workflow with the best corresponding parameter
random_forest_wflow_tuned <- random_forest_wflow |>
tune::finalize_workflow(random_forest_best_param)
# Fitting the tuned (best model)
random_forest_tuned_fitted <- random_forest_wflow_tuned |>
last_fit(split = data_split)
# Collecting the metrics of the best tunned parameter fit over all the available training set
random_forest_tuned_metrics <- random_forest_tuned_fitted |>
collect_metrics()
# Show the metrics as a kable
random_forest_tuned_metrics |>
knitr::kable(caption = "Metrics for one predictor random forests tunned model")
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | multiclass | 0.9399835 | Preprocessor1_Model1 |
| roc_auc | hand_till | 0.9895933 | Preprocessor1_Model1 |
| brier_class | multiclass | 0.0462004 | Preprocessor1_Model1 |
all_tested_models_metrics <- all_tested_models_metrics |>
add_row(random_forest_tuned_metrics |>
filter(.metric == "roc_auc") |>
select(-.config) |>
mutate(model = "random forests",
name = "random_forest_tuned_fitted") |>
relocate(model))
all_tested_models_metrics |>
knitr::kable(caption = "Roc-auc for tunned (tested) models ")
| model | .metric | .estimator | .estimate | name |
|---|---|---|---|---|
| lasso | roc_auc | hand_till | 0.9904114 | one_pred_lasso_tuned_fitted |
| ridge | roc_auc | hand_till | 0.9849885 | ridge_tuned_fitted |
| elastic-net | roc_auc | hand_till | 0.9904704 | elastic_net_tuned_fitted |
| random forests | roc_auc | hand_till | 0.9895933 | random_forest_tuned_fitted |
library(magrittr)
best_model <- all_tested_models_metrics |>
pull(name) |>
extract2(which.max(all_tested_models_metrics$.estimate))
eval(as.name(best_model)) |>
collect_predictions() |>
conf_mat(.pred_class, product) |>
autoplot(type = "heatmap") +
scale_fill_gradient(low = "beige", high = "cadetblue")
Figure 5.1: Confustion Matrix for the Random Forest Model (the best tuned model)
pred_lasso <- lasso_tuned_fitted |>
extract_workflow() |>
predict(test_tidy) |>
pull(.pred_class)
# pred_lasso
pred_ridge <- ridge_tuned_fitted |>
extract_workflow() |>
predict(test_tidy) |>
pull(.pred_class)
# pred_ridge
pred_elastic_net <- elastic_net_tuned_fitted |>
extract_workflow() |>
predict(test_tidy) |>
pull(.pred_class)
# pred_elastic_net
pred_random_forest <- random_forest_tuned_fitted |>
extract_workflow() |>
predict(test_tidy) |>
pull(.pred_class)
# pred_random_forest
all_tested_models_predictions <- tibble(
complaint_id = 1:20,
lasso = pred_lasso,
ridge = pred_ridge,
elastic_net = pred_elastic_net,
random_forest = pred_random_forest,
)
library(kableExtra)
all_tested_models_predictions |>
kbl(caption = "Predicted products for test set (tested tuned models, best one in term in auc-roc highlighted)") |>
kable_paper(full_width = F) |>
column_spec(which.max(all_tested_models_metrics$.estimate)+1,
color = "black",
background = "cadetblue")
| complaint_id | lasso | ridge | elastic_net | random_forest |
|---|---|---|---|---|
| 1 | Student loan | Student loan | Student loan | Vehicle loan or lease |
| 2 | Vehicle loan or lease | Vehicle loan or lease | Vehicle loan or lease | Vehicle loan or lease |
| 3 | Student loan | Student loan | Student loan | Student loan |
| 4 | Mortgage | Mortgage | Mortgage | Mortgage |
| 5 | Student loan | Credit card or prepaid card | Student loan | Vehicle loan or lease |
| 6 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 7 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 8 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 9 | Student loan | Student loan | Student loan | Student loan |
| 10 | Vehicle loan or lease | Credit card or prepaid card | Vehicle loan or lease | Vehicle loan or lease |
| 11 | Student loan | Student loan | Student loan | Student loan |
| 12 | Mortgage | Mortgage | Mortgage | Mortgage |
| 13 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 14 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 15 | Mortgage | Mortgage | Mortgage | Mortgage |
| 16 | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card | Credit card or prepaid card |
| 17 | Mortgage | Mortgage | Mortgage | Mortgage |
| 18 | Student loan | Student loan | Student loan | Student loan |
| 19 | Mortgage | Mortgage | Mortgage | Vehicle loan or lease |
| 20 | Mortgage | Mortgage | Mortgage | Mortgage |