Peer-graded Assignment: Modeling Data in the Tidyverse Course Project

In this course, we have learned about modeling data in the Tidyverse in R. This project will give you the opportunity to practice those skills in greater depth.

Here, we will continue to use consumer complaints data from the Consumer Complaint Database (CFPB) that was used in the Wranging Data in the Tidyverse Course Project. The CFPB is an independent agency of the United States government that promotes transparency and protects consumers by providing information needed to make decisions when choosing financial institutions including banking institutions, lenders, mortgage services, credit unions, securities firms, foreclosure services, and debt collectors. One of the purposes of the agency is to receive and process complaints and questions about consumer financial products and services.

When a complaint is submitted by a consumer, the CFPB has to determine which category the complaint falls in (e.g. “Mortgage”, “Student loan”, etc). In this project, your goal will be to use the skills you have learned about in this course to build a classification algorithm to classify consumer complaints into one of four categories: “Credit card or prepaid card”, “Mortgage”, “Student loan”, or “Vehicle loan or lease”.

Load the packages and data

library(tidyverse)
library(glmnet)
library(tidytext)
library(tidymodels)
library(janitor)
library(textrecipes)
library(ggrepel)
library(hardhat)


complaints_train <- read_csv("data/complaints_train.csv") %>% 
  clean_names()
# Glimpse Data
glimpse(complaints_train)
## Rows: 90,975
## Columns: 6
## $ product                      <chr> "Credit card or prepaid card", "Mortgage"…
## $ consumer_complaint_narrative <chr> "I initially in writing to Chase Bank in …
## $ company                      <chr> "JPMORGAN CHASE & CO.", "Ditech Financial…
## $ state                        <chr> "CT", "GA", "IN", "MI", "MI", "FL", "WA",…
## $ zip_code                     <chr> "064XX", "None", "463XX", "490XX", "480XX…
## $ submitted_via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",…

We can see that our target variable should instead be of factor type. So let’s convert it -

complaints_train <- complaints_train %>% 
  mutate(product = factor(product))

head(complaints_train)
## # A tibble: 6 × 6
##   product            consumer_complaint_n…¹ company state zip_code submitted_via
##   <fct>              <chr>                  <chr>   <chr> <chr>    <chr>        
## 1 Credit card or pr… "I initially in writi… JPMORG… CT    064XX    Web          
## 2 Mortgage           "My ex husband and my… Ditech… GA    None     Web          
## 3 Student loan       "I was a student at X… Navien… IN    463XX    Web          
## 4 Credit card or pr… "It has come to my at… CITIBA… MI    490XX    Web          
## 5 Credit card or pr… "This banks new firea… CITIBA… MI    480XX    Web          
## 6 Credit card or pr… "I only use my walmar… SYNCHR… FL    331XX    Web          
## # ℹ abbreviated name: ¹​consumer_complaint_narrative

Let’s rename the complaints text column, also

complaints_train <- complaints_train %>% 
  rename(complaint = consumer_complaint_narrative)

Now major goal of the project is to achieve data pre-processing before diving into modeling. So let’s proceed for that using tidytext framework.

Data Pre-processing

Let’s view First 5 complaint texts in full

complaints_train %>% 
  head(5) %>% 
  pull(complaint)
## [1] "I initially in writing to Chase Bank in late XXXX about a charge that was unauthorized on XXXX from XXXX in the amount of {$790.00} after many letters and complaints they still did not do anything yet in XXXX they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost 8 months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in XXXX XXXX XXXX XXXX against them"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with XXXX XXXX back in XX/XX/XXXX. I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in XX/XX/XXXX ( she is now XXXX of age ) as this happened years ago. \n\nNow after 21 years, XXXX XXXX XXXX XXXX  filled a XXXX XXXX to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS XX/XX/XXXX which stated that due to DITECH Mortgage sending a XXXX with taxable income of {$17000.00}, this could cause me to have to owe the IRS {$2900.00}. IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the XXXX in order ( so it is not XXXX, XXXX taxable income ). \n\nI contacted DITECH at XXXX spoke with a \" XXXX '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of XXXX  XXXX of a debt from a mortgage that I had back in XX/XX/XXXX with DITECH nor did I ever receive a XXXX from them. I asked them to rescind the XXXX ( taxable income of {$17000.00} ) as this is a debt from \" 20 Years ago '' in which I should not be taxed for taxable income of XXXX, XXXX on a debt from 20 years ago. However, XXXX refused to rescind ( reverse ) the XXXX. \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the 7 to 10 year period that the debt is lawfully removed from my credit. They can't go back 20 years later AND IN XX/XX/XXXX send the IRS and myself a XXXX form to tax me {$17000.00} for a debt from over 20 years ago that WAS charged off my credit report back in the year XX/XX/XXXX!!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over 20 years now and it has negatively affected my credit for way over the 7 year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT {$17000.00} then they should have taxed me 20 yrs ago ( BEFORE THE 7 TO 10 YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a XXXX taxable income, they should have done this YEARS AGO NOT IN XX/XX/XXXX. I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE 20 YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN XX/XX/XXXX ON A DEBT FROM THE YEAR XX/XX/XXXX"
## [3] "I was a student at XXXX XXXX from XX/XX/XXXX-XX/XX/XXXX. I accumulated XXXX of debt in student loans. This school flat out lied to me. I was under the assumption it would have been less than half of that since I had a lot of credits that transferred, not to mention this school promised me job placement, which they did not! They told me I would be making a lot more than what I do. They gave me false information. The only thing this school did for me is put me in tons of debt and lied to me. I am a single mother of XXXX  and was so excited to get my degree. Until I graduated and realize how much they lied to me. The school is now shut down. Can I get these loans forgiven due to being lied to and given false information that basically pressured me into enrolling. If I would have known that my student loans would have been this high. I would have just stayed at the XXXX college where I did my prerequisites. I feel like I fool that XXXX had done this to me."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [4] "It has come to my attention the CITI group is actively attempting to interfere with rights guaranteed by the Constitution ( 2nd Amendment ) by manipulating or denying service to certain entities or transactions that are protected by the Constitution. I do not expect CITI group, or any other financial institution to be controlling the nation 's activities through such unpatriotic and un-American behavior. I do expect CITI group to tend to the business of their business and stay out of the business of social tyranny."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [5] "This banks new firearm policies run counter to laws and regulations passed by Congress, and they infringe and discriminate against an individuals Second Amendment rights. Such policies, should not be endorsed by our federal government, which instead should do business with companies that respect all of our constitutional rights, including the Second Amendment. Our federal government should take all necessary steps to review and terminate its contract with Citibank unless they rescind their guidelines."

We can see that there are many text sub-strings to clean like, amount field, date field, punctuation, XXXs, numbers, etc. So let’s clean the complaint column before proceeding.

cleaned <- complaints_train %>% 
  # remove dates
  mutate(complaint = str_remove_all(complaint, "..\\/..\\/....")) %>% 
  # remove all XXX...
  mutate(complaint = str_remove_all(complaint, "X(X+)|x(x+)")) %>% 
  # amount fields
  mutate(complaint = str_remove_all(complaint, "\\{\\$[0-9\\.]*\\}")) %>% 
  # digits, etc.
  mutate(complaint = str_remove_all(complaint, "[0-9]+")) %>% 
  # sentence breaks
  mutate(complaint = str_remove_all(complaint, "\n_*\n")) %>% 
  # punctuation
  mutate(complaint = str_remove_all(complaint, "(_+\\s?)+"),
         case_id = row_number()) %>% #create case id for future ref
  drop_na() # remove na rows if any after removal of such words above

Let’s view top -5 rows again

cleaned %>% 
  head(5) %>% 
  pull(complaint)
## [1] "I initially in writing to Chase Bank in late  about a charge that was unauthorized on  from  in the amount of  after many letters and complaints they still did not do anything yet in  they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost  months later it is still on my statement and I'm getting charged interest. Even after this I still use my card and PAY for my charges and have never been late however I will not pay for a charge I didn't make and refuse and Chase refuses to do anything even after many attempts and letters they write stating so. I feel this is last step before I begin a case in     against them"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [2] "My ex husband and myself had a mobile home ( Home Mortgage ) with   back in . I went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him. The Mortgage was in both of our names. We only had the Mortgage for one year and they got the trailer bc of being unable to pay debt. I was not working at the time as I had just given birth to my first child in  ( she is now  of age ) as this happened years ago. Now after  years,      filled a   to the IRS and the IRS alerted me which caused me an adverse affect on my IRS taxes. \nI received a l letter from the IRS  which stated that due to DITECH Mortgage sending a  with taxable income of , this could cause me to have to owe the IRS . IRS Told me to contact DITECH Financial to dispute this and request that DITECH Mortgage rescind the  in order ( so it is not ,  taxable income ). I contacted DITECH at  spoke with a \"  '' and I TOLD HER THAT I did not want to be recorded but she did it anyways. \nI told her that I never settled on an amount of    of a debt from a mortgage that I had back in  with DITECH nor did I ever receive a  from them. I asked them to rescind the  ( taxable income of  ) as this is a debt from \"  Years ago '' in which I should not be taxed for taxable income of ,  on a debt from  years ago. However,  refused to rescind ( reverse ) the . \nNOTE, I have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the  to  year period that the debt is lawfully removed from my credit. They can't go back  years later AND IN  send the IRS and myself a  form to tax me  for a debt from over  years ago that WAS charged off my credit report back in the year !!! this is very unfair and not allowed. \nALSO< I disputing the fact that they kept putting this on and off my credit for over  years now and it has negatively affected my credit for way over the  year period. \nPOINT BEING< IF THEY HAD PLANS OF TAXING ME AT  then they should have taxed me  yrs ago ( BEFORE THE  TO  YEAR PERIOD PASSED WHICH IS NOW WAY OVER ). If they were going to send a  taxable income, they should have done this YEARS AGO NOT IN . I AM REQUESTING A COMPLAINT SUBMITTED AS THEY HAVE UNFAIRLY AFFECTED MY CREDIT PAST THE  YEAR PERIOD AND IT IS VIOLATING MY RIGHTS TO BE TAXED FOR SOMETHING IN  ON A DEBT FROM THE YEAR "
## [3] "I was a student at   from -. I accumulated  of debt in student loans. This school flat out lied to me. I was under the assumption it would have been less than half of that since I had a lot of credits that transferred, not to mention this school promised me job placement, which they did not! They told me I would be making a lot more than what I do. They gave me false information. The only thing this school did for me is put me in tons of debt and lied to me. I am a single mother of   and was so excited to get my degree. Until I graduated and realize how much they lied to me. The school is now shut down. Can I get these loans forgiven due to being lied to and given false information that basically pressured me into enrolling. If I would have known that my student loans would have been this high. I would have just stayed at the  college where I did my prerequisites. I feel like I fool that  had done this to me."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [4] "It has come to my attention the CITI group is actively attempting to interfere with rights guaranteed by the Constitution ( nd Amendment ) by manipulating or denying service to certain entities or transactions that are protected by the Constitution. I do not expect CITI group, or any other financial institution to be controlling the nation 's activities through such unpatriotic and un-American behavior. I do expect CITI group to tend to the business of their business and stay out of the business of social tyranny."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [5] "This banks new firearm policies run counter to laws and regulations passed by Congress, and they infringe and discriminate against an individuals Second Amendment rights. Such policies, should not be endorsed by our federal government, which instead should do business with companies that respect all of our constitutional rights, including the Second Amendment. Our federal government should take all necessary steps to review and terminate its contract with Citibank unless they rescind their guidelines."

It would be better, if all complaint is converted to same case i.e. lower case.

cleaned <- cleaned %>% 
  mutate(complaint = tolower(complaint))

Now let’s proceed for a few visualisations before cleaning the text into rows.

cleaned %>% 
  ggplot(aes(x = fct_rev(fct_infreq(product)))) +
  geom_bar(fill = "seagreen") +
  geom_label(aes(label = ..count..), stat = "count", hjust = 1.5) +
  coord_flip() +
  theme_bw() +
  labs(x = "", y = "")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can notice high imbalance in each class. Now Let’s proceed for word tokenisation. We will use the workflow of tidytext package as described in Julia Silge's awesome book.

unnest_words <- cleaned %>% 
  unnest_tokens(output = word, input = complaint) %>% 
  count(product, word)

# View Top-10 rows (results will be in a tibble)
unnest_words %>% 
  arrange(desc(n))
## # A tibble: 119,165 × 3
##    product                     word       n
##    <fct>                       <chr>  <int>
##  1 Mortgage                    the   441359
##  2 Credit card or prepaid card the   398873
##  3 Credit card or prepaid card i     368090
##  4 Mortgage                    to    358051
##  5 Mortgage                    i     303823
##  6 Credit card or prepaid card to    303199
##  7 Mortgage                    and   274417
##  8 Credit card or prepaid card and   244352
##  9 Mortgage                    a     190165
## 10 Credit card or prepaid card my    184459
## # ℹ 119,155 more rows

We can see lots of stop words being used. So let’s remove them using tidytext::stop_words inbuilt data.

unnest_words <- unnest_words %>% 
  anti_join(stop_words, by = join_by("word"))

unnest_words %>% 
  arrange(desc(n))
## # A tibble: 116,365 × 3
##    product                     word         n
##    <fct>                       <chr>    <int>
##  1 Credit card or prepaid card card     91716
##  2 Credit card or prepaid card credit   86689
##  3 Mortgage                    mortgage 72708
##  4 Mortgage                    loan     72186
##  5 Credit card or prepaid card account  69050
##  6 Mortgage                    payment  56531
##  7 Credit card or prepaid card payment  34039
##  8 Credit card or prepaid card bank     31989
##  9 Mortgage                    payments 31354
## 10 Student loan                loan     30970
## # ℹ 116,355 more rows

Look’s Much better. Let’s see top-10 words for each product category.

unnest_words %>% 
  group_by(product) %>% 
  slice_max(n, n= 10) %>% 
  ggplot(aes(reorder_within(word, n, product), n, fill = product)) + 
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(. ~ product, scales = "free_y") +
  theme_bw() +
  labs(x = "", y = "")

We can see words like payment, loan, etc. occuring in each product category, so this method of counting frequency is not appropriate one for modeling. We can use Term Frequency-Inverse Document Frequency or TF-IDF as described in above mentioned book. This will give us a metric which will represent how often a word appears in one category while being least frequent in other categories.

unnest_tfidf <- unnest_words %>% 
  bind_tf_idf(term = word,
              document = product, 
              n = n)

unnest_tfidf %>% 
  arrange(-tf_idf) %>% 
  head(10)
## # A tibble: 10 × 6
##    product                     word           n       tf   idf  tf_idf
##    <fct>                       <chr>      <int>    <dbl> <dbl>   <dbl>
##  1 Student loan                navient    13434 0.0133   0.693 0.00925
##  2 Student loan                fedloan     1768 0.00176  1.39  0.00243
##  3 Credit card or prepaid card amex        4402 0.00146  1.39  0.00202
##  4 Student loan                pslf        2306 0.00229  0.693 0.00159
##  5 Student loan                nelnet      2257 0.00224  0.693 0.00155
##  6 Mortgage                    phh         3763 0.00111  1.39  0.00154
##  7 Vehicle loan or lease       gm          1607 0.00211  0.693 0.00146
##  8 Mortgage                    shellpoint  3552 0.00105  1.39  0.00145
##  9 Mortgage                    loancare    3357 0.000988 1.39  0.00137
## 10 Mortgage                    sps         3313 0.000975 1.39  0.00135

Now we can use the same chart, as above to see the word-tokens with highest tf_idf per category.

unnest_tfidf %>% 
  group_by(product) %>% 
  slice_max(tf_idf, n= 10) %>% 
  ggplot(aes(reorder_within(word, tf_idf, product), tf_idf, fill = product)) + 
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(. ~ product, scales = "free_y") +
  theme_bw() +
  labs(x = "", y = "")

Now the data seems more meaningful. We can see some bank names in credit card category, car company names in vehicle loan category, etc.

Model Building

Data Prep

We have already cleaned the data, so let’s load the data

cleaned <- cleaned %>% 
  select(complaint, product)

Out of Sample Testing

Though the data provided has already been split into train and test sets, yet out-of-sample testing is not possible as the class/outcome in test data is not available. Let’s therefore create training and test splits 75:25, first.

set.seed(200776)
initial_split_df <- cleaned %>% 
  initial_split(prop = 0.75, strata = product)

train_df <- training(initial_split_df)
test_df <- testing(initial_split_df)

Model Type Selection

Let’s try the Multinomial Regression with 0.01 penalty and lasso regression thus making mixture = 1.

my_model1 <- multinom_reg(penalty = 0.01, mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_mode("classification")

Recipe Preparation/Bake

Let’s use the cleaned data for modelling.

my_recipe <- recipe(product ~ complaint, data = train_df) %>% 
  # Tokenize
  step_tokenize(complaint) %>% 
  # remove plurals, etc.
  step_stem(complaint) %>% 
  # remove stop words
  step_stopwords(complaint) %>% 
  # filter token 
  step_tokenfilter(complaint, max_tokens = 500, min_times = 5) %>% 
  # Most IMPORTANT step
  step_tfidf(complaint) %>% 
  # downsampling step
  themis::step_downsample()

my_recipe

Let’s create workflow and fit the Lasso model

Let’s create the workflow with above recipe and model

lasso_workflow <- workflow() %>% 
  add_model(my_model1) %>% 
  add_recipe(my_recipe)

### Evaluate metrics

lasso_workflow %>%
  last_fit(split = initial_split_df) %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy multiclass     0.910 Preprocessor1_Model1
## 2 roc_auc  hand_till      0.981 Preprocessor1_Model1

Trial with decision tree algorithm

Let’s try the same with decision tree algorithm with default parameters

## Model
my_model2 <- decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")
## Update workflow
dt_wf <- workflow() %>% 
  add_recipe(my_recipe) %>% 
  add_model(my_model2)
## Inspect Metrics
dt_wf %>% 
  last_fit(split = initial_split_df) %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy multiclass     0.846 Preprocessor1_Model1
## 2 roc_auc  hand_till      0.935 Preprocessor1_Model1

Trial with random forest

Let’s try the same with random forest algorithm also, before finalising.

rf_model <- rand_forest(mtry = 50, min_n = 25, trees = 45) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

## New WF
rf_wf <- workflow() %>% 
  add_recipe(my_recipe) %>% 
  add_model(rf_model)

## Inspect metrics
rf_wf %>% 
  last_fit(split = initial_split_df) %>% 
  collect_metrics() 
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy multiclass     0.934 Preprocessor1_Model1
## 2 roc_auc  hand_till      0.989 Preprocessor1_Model1

Finalise Model selection

As both ROC-AUC and accuracy are higher in RF model with randomly chosen parameters mtry, min_n and tress, we should select that model and try to further optimize its performance by tuning some hyper-parameters.

Cross-Fold Validation

Let’s divide the data into 5 folds for cross validation.

set.seed(200776)

my_folds <- cleaned %>%
  vfold_cv(v = 5, strata = product)

Hyper-parameter tuning

Let’s do it with 5 randomly selected values. I have included random seed so that the code can be replicated.

## Tune Lasso
rf_tune <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 45
) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

## Update model (tune) in workflow
rf_wf_tune <- rf_wf %>% 
  update_model(rf_tune)


# parallel computing due to size of database and cross-folds

all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(2007)

tuned_rf <- tune_grid(
  rf_wf_tune,
  resamples = my_folds,
  grid = 5
)

Let’s see the results and best model

tuned_rf %>% 
  show_best("accuracy")
## # A tibble: 5 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1    44    36 accuracy multiclass 0.935     5 0.00102 Preprocessor1_Model3
## 2   170    20 accuracy multiclass 0.932     5 0.00117 Preprocessor1_Model1
## 3   289     8 accuracy multiclass 0.930     5 0.00126 Preprocessor1_Model5
## 4   385    32 accuracy multiclass 0.930     5 0.00146 Preprocessor1_Model4
## 5   439    13 accuracy multiclass 0.929     5 0.00131 Preprocessor1_Model2

Visualize performance for different values of penalty i.e. amount of regularisation

tuned_rf %>% 
  autoplot()

See Visuals

tuned_rf %>%
    collect_metrics() %>%
    filter(.metric == "roc_auc") %>%
    select(mean, min_n, mtry) %>%
    pivot_longer(min_n:mtry,
                 values_to = "value",
                 names_to = "parameter") %>%
    ggplot(aes(value, mean, color = parameter)) +
    geom_point(show.legend = FALSE) +
    facet_wrap(~parameter, scales = "free_x")

Finalise model

Finalise workflow

best_rf <- tuned_rf %>% 
  select_best("roc_auc")

wf_final <- rf_wf_tune %>% 
  finalize_workflow(best_rf)

Fit model

registerDoParallel(cl)
fit_final <- wf_final %>% 
  fit(cleaned)

Make Predictions (final goal of the exercise)

test <- read_csv("data/test.csv") %>% 
  clean_names()

test_cleaned <- test %>% 
  # rename column
  rename(complaint = consumer_complaint_narrative) %>% 
  select(complaint) %>% 
  # remove dates
  mutate(complaint = str_remove_all(complaint, "..\\/..\\/....")) %>% 
  # remove all XXX...
  mutate(complaint = str_remove_all(complaint, "X(X+)|x(x+)")) %>% 
  # amount fields
  mutate(complaint = str_remove_all(complaint, "\\{\\$[0-9\\.]*\\}")) %>% 
  # digits, etc.
  mutate(complaint = str_remove_all(complaint, "[0-9]+")) %>% 
  # sentence breaks
  mutate(complaint = str_remove_all(complaint, "\n_*\n")) %>% 
  # punctuation
  mutate(complaint = str_remove_all(complaint, "(_+\\s?)+"))

preds <- fit_final %>% 
  predict(test_cleaned) %>% 
  bind_cols(test)

View all Predictions

preds[, 1]
## # A tibble: 20 × 1
##    .pred_class                
##    <fct>                      
##  1 Student loan               
##  2 Vehicle loan or lease      
##  3 Student loan               
##  4 Mortgage                   
##  5 Vehicle loan or lease      
##  6 Credit card or prepaid card
##  7 Credit card or prepaid card
##  8 Credit card or prepaid card
##  9 Student loan               
## 10 Vehicle loan or lease      
## 11 Student loan               
## 12 Mortgage                   
## 13 Vehicle loan or lease      
## 14 Credit card or prepaid card
## 15 Mortgage                   
## 16 Credit card or prepaid card
## 17 Mortgage                   
## 18 Student loan               
## 19 Vehicle loan or lease      
## 20 Mortgage