In this course, we have learned about modeling data in the Tidyverse in R. This project will give you the opportunity to practice those skills in greater depth.
Here, we will continue to use consumer complaints data from the Consumer Complaint Database (CFPB) that was used in the Wranging Data in the Tidyverse Course Project. The CFPB is an independent agency of the United States government that promotes transparency and protects consumers by providing information needed to make decisions when choosing financial institutions including banking institutions, lenders, mortgage services, credit unions, securities firms, foreclosure services, and debt collectors. One of the purposes of the agency is to receive and process complaints and questions about consumer financial products and services.
When a complaint is submitted by a consumer, the CFPB has to determine which category the complaint falls in (e.g. “Mortgage”, “Student loan”, etc). In this project, your goal will be to use the skills you have learned about in this course to build a classification algorithm to classify consumer complaints into one of four categories: “Credit card or prepaid card”, “Mortgage”, “Student loan”, or “Vehicle loan or lease”.
library(tidyverse)
library(glmnet)
library(tidytext)
library(tidymodels)
library(janitor)
library(textrecipes)
library(ggrepel)
library(hardhat)
complaints_train <- read_csv("data/complaints_train.csv") %>%
clean_names()
# Glimpse Data
glimpse(complaints_train)
## Rows: 90,975
## Columns: 6
## $ product <chr> "Credit card or prepaid card", "Mortgage"…
## $ consumer_complaint_narrative <chr> "I initially in writing to Chase Bank in …
## $ company <chr> "JPMORGAN CHASE & CO.", "Ditech Financial…
## $ state <chr> "CT", "GA", "IN", "MI", "MI", "FL", "WA",…
## $ zip_code <chr> "064XX", "None", "463XX", "490XX", "480XX…
## $ submitted_via <chr> "Web", "Web", "Web", "Web", "Web", "Web",…
We can see that our target variable should instead be of factor type. So let’s convert it -
complaints_train <- complaints_train %>%
mutate(product = factor(product))
head(complaints_train)
## # A tibble: 6 × 6
## product consumer_complaint_n…¹ company state zip_code submitted_via
## <fct> <chr> <chr> <chr> <chr> <chr>
## 1 Credit card or pr… "I initially in writi… JPMORG… CT 064XX Web
## 2 Mortgage "My ex husband and my… Ditech… GA None Web
## 3 Student loan "I was a student at X… Navien… IN 463XX Web
## 4 Credit card or pr… "It has come to my at… CITIBA… MI 490XX Web
## 5 Credit card or pr… "This banks new firea… CITIBA… MI 480XX Web
## 6 Credit card or pr… "I only use my walmar… SYNCHR… FL 331XX Web
## # ℹ abbreviated name: ¹consumer_complaint_narrative
Let’s rename the complaints text column, also
complaints_train <- complaints_train %>%
rename(complaint = consumer_complaint_narrative)
Now major goal of the project is to achieve data pre-processing
before diving into modeling. So let’s proceed for that using
tidytext framework.
Let’s view First 5 complaint texts in full
complaints_train %>%
head(5) %>%
pull(complaint)
## [1] "I initially in writing to Chase Bank in late XXXX about a charge that was unauthorized on XXXX from XXXX in the amount of {$790.00} after many letters and complaints they still did not do anything yet in XXXX they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost 8 months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in XXXX XXXX XXXX XXXX against them"
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with XXXX XXXX back in XX/XX/XXXX. I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in XX/XX/XXXX ( she is now XXXX of age ) as this happened years ago. \n\nNow after 21 years, XXXX XXXX XXXX XXXX filled a XXXX XXXX to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS XX/XX/XXXX which stated that due to DITECH Mortgage sending a XXXX with taxable income of {$17000.00}, this could cause me to have to owe the IRS {$2900.00}. IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the XXXX in order ( so it is not XXXX, XXXX taxable income ). \n\nI contacted DITECH at XXXX spoke with a \" XXXX '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of XXXX XXXX of a debt from a mortgage that I had back in XX/XX/XXXX with DITECH nor did I ever receive a XXXX from them. I asked them to rescind the XXXX ( taxable income of {$17000.00} ) as this is a debt from \" 20 Years ago '' in which I should not be taxed for taxable income of XXXX, XXXX on a debt from 20 years ago. However, XXXX refused to rescind ( reverse ) the XXXX. \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the 7 to 10 year period that the debt is lawfully removed from my credit. They can't go back 20 years later AND IN XX/XX/XXXX send the IRS and myself a XXXX form to tax me {$17000.00} for a debt from over 20 years ago that WAS charged off my credit report back in the year XX/XX/XXXX!!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over 20 years now and it has negatively affected my credit for way over the 7 year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT {$17000.00} then they should have taxed me 20 yrs ago ( BEFORE THE 7 TO 10 YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a XXXX taxable income, they should have done this YEARS AGO NOT IN XX/XX/XXXX. I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE 20 YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN XX/XX/XXXX ON A DEBT FROM THE YEAR XX/XX/XXXX"
## [3] "I was a student at XXXX XXXX from XX/XX/XXXX-XX/XX/XXXX. I accumulated XXXX of debt in student loans. This school flat out lied to me. I was under the assumption it would have been less than half of that since I had a lot of credits that transferred, not to mention this school promised me job placement, which they did not! They told me I would be making a lot more than what I do. They gave me false information. The only thing this school did for me is put me in tons of debt and lied to me. I am a single mother of XXXX and was so excited to get my degree. Until I graduated and realize how much they lied to me. The school is now shut down. Can I get these loans forgiven due to being lied to and given false information that basically pressured me into enrolling. If I would have known that my student loans would have been this high. I would have just stayed at the XXXX college where I did my prerequisites. I feel like I fool that XXXX had done this to me."
## [4] "It has come to my attention the CITI group is actively attempting to interfere with rights guaranteed by the Constitution ( 2nd Amendment ) by manipulating or denying service to certain entities or transactions that are protected by the Constitution. I do not expect CITI group, or any other financial institution to be controlling the nation 's activities through such unpatriotic and un-American behavior. I do expect CITI group to tend to the business of their business and stay out of the business of social tyranny."
## [5] "This banks new firearm policies run counter to laws and regulations passed by Congress, and they infringe and discriminate against an individuals Second Amendment rights. Such policies, should not be endorsed by our federal government, which instead should do business with companies that respect all of our constitutional rights, including the Second Amendment. Our federal government should take all necessary steps to review and terminate its contract with Citibank unless they rescind their guidelines."
We can see that there are many text sub-strings to clean like, amount field, date field, punctuation, XXXs, numbers, etc. So let’s clean the complaint column before proceeding.
cleaned <- complaints_train %>%
# remove dates
mutate(complaint = str_remove_all(complaint, "..\\/..\\/....")) %>%
# remove all XXX...
mutate(complaint = str_remove_all(complaint, "X(X+)|x(x+)")) %>%
# amount fields
mutate(complaint = str_remove_all(complaint, "\\{\\$[0-9\\.]*\\}")) %>%
# digits, etc.
mutate(complaint = str_remove_all(complaint, "[0-9]+")) %>%
# sentence breaks
mutate(complaint = str_remove_all(complaint, "\n_*\n")) %>%
# punctuation
mutate(complaint = str_remove_all(complaint, "(_+\\s?)+"),
case_id = row_number()) %>% #create case id for future ref
drop_na() # remove na rows if any after removal of such words above
Let’s view top -5 rows again
cleaned %>%
head(5) %>%
pull(complaint)
## [1] "I initially in writing to Chase Bank in late about a charge that was unauthorized on from in the amount of after many letters and complaints they still did not do anything yet in they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in against them"
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with back in . I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in ( she is now of age ) as this happened years ago. Now after years, filled a to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS which stated that due to DITECH Mortgage sending a with taxable income of , this could cause me to have to owe the IRS . IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the in order ( so it is not , taxable income ). I contacted DITECH at spoke with a \" '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of of a debt from a mortgage that I had back in with DITECH nor did I ever receive a from them. I asked them to rescind the ( taxable income of ) as this is a debt from \" Years ago '' in which I should not be taxed for taxable income of , on a debt from years ago. However, refused to rescind ( reverse ) the . \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the to year period that the debt is lawfully removed from my credit. They can't go back years later AND IN send the IRS and myself a form to tax me for a debt from over years ago that WAS charged off my credit report back in the year !!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over years now and it has negatively affected my credit for way over the year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT then they should have taxed me yrs ago ( BEFORE THE TO YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a taxable income, they should have done this YEARS AGO NOT IN . I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN ON A DEBT FROM THE YEAR "
## [3] "I was a student at from -. I accumulated of debt in student loans. This school flat out lied to me. I was under the assumption it would have been less than half of that since I had a lot of credits that transferred, not to mention this school promised me job placement, which they did not! They told me I would be making a lot more than what I do. They gave me false information. The only thing this school did for me is put me in tons of debt and lied to me. I am a single mother of and was so excited to get my degree. Until I graduated and realize how much they lied to me. The school is now shut down. Can I get these loans forgiven due to being lied to and given false information that basically pressured me into enrolling. If I would have known that my student loans would have been this high. I would have just stayed at the college where I did my prerequisites. I feel like I fool that had done this to me."
## [4] "It has come to my attention the CITI group is actively attempting to interfere with rights guaranteed by the Constitution ( nd Amendment ) by manipulating or denying service to certain entities or transactions that are protected by the Constitution. I do not expect CITI group, or any other financial institution to be controlling the nation 's activities through such unpatriotic and un-American behavior. I do expect CITI group to tend to the business of their business and stay out of the business of social tyranny."
## [5] "This banks new firearm policies run counter to laws and regulations passed by Congress, and they infringe and discriminate against an individuals Second Amendment rights. Such policies, should not be endorsed by our federal government, which instead should do business with companies that respect all of our constitutional rights, including the Second Amendment. Our federal government should take all necessary steps to review and terminate its contract with Citibank unless they rescind their guidelines."
It would be better, if all complaint is converted to same case i.e. lower case.
cleaned <- cleaned %>%
mutate(complaint = tolower(complaint))
Now let’s proceed for a few visualisations before cleaning the text into rows.
cleaned %>%
ggplot(aes(x = fct_rev(fct_infreq(product)))) +
geom_bar(fill = "seagreen") +
geom_label(aes(label = ..count..), stat = "count", hjust = 1.5) +
coord_flip() +
theme_bw() +
labs(x = "", y = "")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can notice high imbalance in each class. Now Let’s proceed for word
tokenisation. We will use the workflow of
tidytext package
as described in Julia Silge's awesome book.
unnest_words <- cleaned %>%
unnest_tokens(output = word, input = complaint) %>%
count(product, word)
# View Top-10 rows (results will be in a tibble)
unnest_words %>%
arrange(desc(n))
## # A tibble: 119,165 × 3
## product word n
## <fct> <chr> <int>
## 1 Mortgage the 441359
## 2 Credit card or prepaid card the 398873
## 3 Credit card or prepaid card i 368090
## 4 Mortgage to 358051
## 5 Mortgage i 303823
## 6 Credit card or prepaid card to 303199
## 7 Mortgage and 274417
## 8 Credit card or prepaid card and 244352
## 9 Mortgage a 190165
## 10 Credit card or prepaid card my 184459
## # ℹ 119,155 more rows
We can see lots of stop words being used. So let’s remove them using
tidytext::stop_words inbuilt data.
unnest_words <- unnest_words %>%
anti_join(stop_words, by = join_by("word"))
unnest_words %>%
arrange(desc(n))
## # A tibble: 116,365 × 3
## product word n
## <fct> <chr> <int>
## 1 Credit card or prepaid card card 91716
## 2 Credit card or prepaid card credit 86689
## 3 Mortgage mortgage 72708
## 4 Mortgage loan 72186
## 5 Credit card or prepaid card account 69050
## 6 Mortgage payment 56531
## 7 Credit card or prepaid card payment 34039
## 8 Credit card or prepaid card bank 31989
## 9 Mortgage payments 31354
## 10 Student loan loan 30970
## # ℹ 116,355 more rows
Look’s Much better. Let’s see top-10 words for each product category.
unnest_words %>%
group_by(product) %>%
slice_max(n, n= 10) %>%
ggplot(aes(reorder_within(word, n, product), n, fill = product)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(. ~ product, scales = "free_y") +
theme_bw() +
labs(x = "", y = "")
We
can see words like
payment, loan, etc.
occuring in each product category, so this method of counting frequency
is not appropriate one for modeling. We can use
Term Frequency-Inverse Document Frequency or
TF-IDF as described in above mentioned book. This will give
us a metric which will represent how often a word appears in one
category while being least frequent in other categories.
unnest_tfidf <- unnest_words %>%
bind_tf_idf(term = word,
document = product,
n = n)
unnest_tfidf %>%
arrange(-tf_idf) %>%
head(10)
## # A tibble: 10 × 6
## product word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 Student loan navient 13434 0.0133 0.693 0.00925
## 2 Student loan fedloan 1768 0.00176 1.39 0.00243
## 3 Credit card or prepaid card amex 4402 0.00146 1.39 0.00202
## 4 Student loan pslf 2306 0.00229 0.693 0.00159
## 5 Student loan nelnet 2257 0.00224 0.693 0.00155
## 6 Mortgage phh 3763 0.00111 1.39 0.00154
## 7 Vehicle loan or lease gm 1607 0.00211 0.693 0.00146
## 8 Mortgage shellpoint 3552 0.00105 1.39 0.00145
## 9 Mortgage loancare 3357 0.000988 1.39 0.00137
## 10 Mortgage sps 3313 0.000975 1.39 0.00135
Now we can use the same chart, as above to see the word-tokens with highest tf_idf per category.
unnest_tfidf %>%
group_by(product) %>%
slice_max(tf_idf, n= 10) %>%
ggplot(aes(reorder_within(word, tf_idf, product), tf_idf, fill = product)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(. ~ product, scales = "free_y") +
theme_bw() +
labs(x = "", y = "")
Now
the data seems more meaningful. We can see some bank names in
credit card category, car company names in
vehicle loan category, etc.
We have already cleaned the data, so let’s load the data
cleaned <- cleaned %>%
select(complaint, product)
Though the data provided has already been split into train and test sets, yet out-of-sample testing is not possible as the class/outcome in test data is not available. Let’s therefore create training and test splits 75:25, first.
set.seed(200776)
initial_split_df <- cleaned %>%
initial_split(prop = 0.75, strata = product)
train_df <- training(initial_split_df)
test_df <- testing(initial_split_df)
Let’s try the Multinomial Regression with 0.01 penalty
and lasso regression thus making
mixture = 1.
my_model1 <- multinom_reg(penalty = 0.01, mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("classification")
Let’s use the cleaned data for modelling.
my_recipe <- recipe(product ~ complaint, data = train_df) %>%
# Tokenize
step_tokenize(complaint) %>%
# remove plurals, etc.
step_stem(complaint) %>%
# remove stop words
step_stopwords(complaint) %>%
# filter token
step_tokenfilter(complaint, max_tokens = 500, min_times = 5) %>%
# Most IMPORTANT step
step_tfidf(complaint) %>%
# downsampling step
themis::step_downsample()
my_recipe
Let’s create the workflow with above recipe and model
lasso_workflow <- workflow() %>%
add_model(my_model1) %>%
add_recipe(my_recipe)
### Evaluate metrics
lasso_workflow %>%
last_fit(split = initial_split_df) %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.910 Preprocessor1_Model1
## 2 roc_auc hand_till 0.981 Preprocessor1_Model1
Let’s try the same with decision tree algorithm with default parameters
## Model
my_model2 <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
## Update workflow
dt_wf <- workflow() %>%
add_recipe(my_recipe) %>%
add_model(my_model2)
## Inspect Metrics
dt_wf %>%
last_fit(split = initial_split_df) %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.846 Preprocessor1_Model1
## 2 roc_auc hand_till 0.935 Preprocessor1_Model1
Let’s try the same with random forest algorithm also, before finalising.
rf_model <- rand_forest(mtry = 50, min_n = 25, trees = 45) %>%
set_mode("classification") %>%
set_engine("ranger")
## New WF
rf_wf <- workflow() %>%
add_recipe(my_recipe) %>%
add_model(rf_model)
## Inspect metrics
rf_wf %>%
last_fit(split = initial_split_df) %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.934 Preprocessor1_Model1
## 2 roc_auc hand_till 0.989 Preprocessor1_Model1
As both ROC-AUC and accuracy are higher in RF model
with randomly chosen parameters mtry, min_n
and tress, we should select that model and try to further
optimize its performance by tuning some hyper-parameters.
Let’s divide the data into 5 folds for cross validation.
set.seed(200776)
my_folds <- cleaned %>%
vfold_cv(v = 5, strata = product)
Let’s do it with 5 randomly selected values. I have included random seed so that the code can be replicated.
## Tune Lasso
rf_tune <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 45
) %>%
set_mode("classification") %>%
set_engine("ranger")
## Update model (tune) in workflow
rf_wf_tune <- rf_wf %>%
update_model(rf_tune)
# parallel computing due to size of database and cross-folds
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
set.seed(2007)
tuned_rf <- tune_grid(
rf_wf_tune,
resamples = my_folds,
grid = 5
)
Let’s see the results and best model
tuned_rf %>%
show_best("accuracy")
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 44 36 accuracy multiclass 0.935 5 0.00102 Preprocessor1_Model3
## 2 170 20 accuracy multiclass 0.932 5 0.00117 Preprocessor1_Model1
## 3 289 8 accuracy multiclass 0.930 5 0.00126 Preprocessor1_Model5
## 4 385 32 accuracy multiclass 0.930 5 0.00146 Preprocessor1_Model4
## 5 439 13 accuracy multiclass 0.929 5 0.00131 Preprocessor1_Model2
penalty
i.e. amount of regularisationtuned_rf %>%
autoplot()
tuned_rf %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, min_n, mtry) %>%
pivot_longer(min_n:mtry,
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x")
best_rf <- tuned_rf %>%
select_best("roc_auc")
wf_final <- rf_wf_tune %>%
finalize_workflow(best_rf)
registerDoParallel(cl)
fit_final <- wf_final %>%
fit(cleaned)
test <- read_csv("data/test.csv") %>%
clean_names()
test_cleaned <- test %>%
# rename column
rename(complaint = consumer_complaint_narrative) %>%
select(complaint) %>%
# remove dates
mutate(complaint = str_remove_all(complaint, "..\\/..\\/....")) %>%
# remove all XXX...
mutate(complaint = str_remove_all(complaint, "X(X+)|x(x+)")) %>%
# amount fields
mutate(complaint = str_remove_all(complaint, "\\{\\$[0-9\\.]*\\}")) %>%
# digits, etc.
mutate(complaint = str_remove_all(complaint, "[0-9]+")) %>%
# sentence breaks
mutate(complaint = str_remove_all(complaint, "\n_*\n")) %>%
# punctuation
mutate(complaint = str_remove_all(complaint, "(_+\\s?)+"))
preds <- fit_final %>%
predict(test_cleaned) %>%
bind_cols(test)
preds[, 1]
## # A tibble: 20 × 1
## .pred_class
## <fct>
## 1 Student loan
## 2 Vehicle loan or lease
## 3 Student loan
## 4 Mortgage
## 5 Vehicle loan or lease
## 6 Credit card or prepaid card
## 7 Credit card or prepaid card
## 8 Credit card or prepaid card
## 9 Student loan
## 10 Vehicle loan or lease
## 11 Student loan
## 12 Mortgage
## 13 Vehicle loan or lease
## 14 Credit card or prepaid card
## 15 Mortgage
## 16 Credit card or prepaid card
## 17 Mortgage
## 18 Student loan
## 19 Vehicle loan or lease
## 20 Mortgage