Load libraries

Training data – complaints from CFPB (data_complaints_train.csv)

train <- read.csv("data_complaints_train.csv")

Testing data – complaints from CFPB

test <- read.csv("data_complaints_test.csv")

The goal of your project is to predict the consumer complaint category. This is the “Product” variable in the training data set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Learn about dataset, perform cleaning

Have a look at the training dataset’s summary:

summary(train)
##    Product          Consumer.complaint.narrative   Company         
##  Length:90975       Length:90975                 Length:90975      
##  Class :character   Class :character             Class :character  
##  Mode  :character   Mode  :character             Mode  :character  
##     State             ZIP.code         Submitted.via     
##  Length:90975       Length:90975       Length:90975      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
train$Product <- factor(train$Product)
levels(train$Product)
## [1] "Credit card or prepaid card" "Mortgage"                   
## [3] "Student loan"                "Vehicle loan or lease"
train$State <- factor(train$State)
levels(train$State)
##  [1] "AA"                                  
##  [2] "AE"                                  
##  [3] "AK"                                  
##  [4] "AL"                                  
##  [5] "AP"                                  
##  [6] "AR"                                  
##  [7] "AS"                                  
##  [8] "AZ"                                  
##  [9] "CA"                                  
## [10] "CO"                                  
## [11] "CT"                                  
## [12] "DC"                                  
## [13] "DE"                                  
## [14] "FL"                                  
## [15] "FM"                                  
## [16] "GA"                                  
## [17] "GU"                                  
## [18] "HI"                                  
## [19] "IA"                                  
## [20] "ID"                                  
## [21] "IL"                                  
## [22] "IN"                                  
## [23] "KS"                                  
## [24] "KY"                                  
## [25] "LA"                                  
## [26] "MA"                                  
## [27] "MD"                                  
## [28] "ME"                                  
## [29] "MH"                                  
## [30] "MI"                                  
## [31] "MN"                                  
## [32] "MO"                                  
## [33] "MS"                                  
## [34] "MT"                                  
## [35] "NC"                                  
## [36] "ND"                                  
## [37] "NE"                                  
## [38] "NH"                                  
## [39] "NJ"                                  
## [40] "NM"                                  
## [41] "None"                                
## [42] "NV"                                  
## [43] "NY"                                  
## [44] "OH"                                  
## [45] "OK"                                  
## [46] "OR"                                  
## [47] "PA"                                  
## [48] "PR"                                  
## [49] "RI"                                  
## [50] "SC"                                  
## [51] "SD"                                  
## [52] "TN"                                  
## [53] "TX"                                  
## [54] "UNITED STATES MINOR OUTLYING ISLANDS"
## [55] "UT"                                  
## [56] "VA"                                  
## [57] "VI"                                  
## [58] "VT"                                  
## [59] "WA"                                  
## [60] "WI"                                  
## [61] "WV"                                  
## [62] "WY"
train %>% glimpse()
## Rows: 90,975
## Columns: 6
## $ Product                      <fct> Credit card or prepaid card, Mortgage, St…
## $ Consumer.complaint.narrative <chr> "I initially in writing to Chase Bank in …
## $ Company                      <chr> "JPMORGAN CHASE & CO.", "Ditech Financial…
## $ State                        <fct> CT, GA, IN, MI, MI, FL, WA, CA, CA, VA, G…
## $ ZIP.code                     <chr> "064XX", "None", "463XX", "490XX", "480XX…
## $ Submitted.via                <chr> "Web", "Web", "Web", "Web", "Web", "Web",…

Create a barplot of product category: see how many categories there are.

ggplot(data = train, aes(x = Product, fill = Product)) +
  geom_bar() +
  labs(x = "Product category", y = "Number of occurrences", fill = "Product category")

Calculate how many NAs there are in each variable.

train %>% 
  map(is.na) %>%
  map(sum)
## $Product
## [1] 0
## 
## $Consumer.complaint.narrative
## [1] 0
## 
## $Company
## [1] 0
## 
## $State
## [1] 0
## 
## $ZIP.code
## [1] 0
## 
## $Submitted.via
## [1] 0

There are no NAs.

Create parallel threads, for faster running time.

doParallel::registerDoParallel(cores=0.75*parallel::detectCores())

In the training dataset, Consumer.complaint.narrative variable, there are strings like XXXX used to redact some data; we’ll get rid of them we’ll also remove numbers, symbols, and make all text in lower case.

train$Consumer.complaint.narrative <- gsub("XX/", "", train$Consumer.complaint.narrative)
train$Consumer.complaint.narrative <- gsub("X/", "", train$Consumer.complaint.narrative)
train$Consumer.complaint.narrative <- gsub("[XX]", "", train$Consumer.complaint.narrative)
train$Consumer.complaint.narrative <- gsub("[[:punct:]]", "", train$Consumer.complaint.narrative)
train$Consumer.complaint.narrative <- gsub("[0-9]", "", train$Consumer.complaint.narrative)
train$Consumer.complaint.narrative <- str_to_lower(train$Consumer.complaint.narrative)

head(train)
##                       Product
## 1 Credit card or prepaid card
## 2                    Mortgage
## 3                Student loan
## 4 Credit card or prepaid card
## 5 Credit card or prepaid card
## 6 Credit card or prepaid card
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Consumer.complaint.narrative
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              i initially in writing to chase bank in late  about a charge that was unauthorized on  from  in the amount of  after many letters and complaints they still did not do anything yet in  they closed my original account and reissued a new account and sent me a letter indicating they moved this charge to fraud and it would be immediately credited however they have failed to do that and almost  months later it is still on my statement and im getting charged interest even after this i still use my card and pay for my charges and have never been late however i will not pay for a charge i didnt make and refuse and chase refuses to do anything even after many attempts and letters they write stating so i feel this is last step before i begin a case in     against them
## 2 my ex husband and myself had a mobile home  home mortgage  with   back in  i went through a horrible divorce as my husband was using drugs and failing to pay the bills which led to my having to divorce him the mortgage was in both of our names we only had the mortgage for one year and they got the trailer bc of being unable to pay debt i was not working at the time as i had just given birth to my first child in   she is now  of age  as this happened years ago \n\nnow after  years      filled a   to the irs and the irs alerted me which caused me an adverse affect on my irs taxes \ni received a l letter from the irs  which stated that due to ditech mortgage sending a  with taxable income of  this could cause me to have to owe the irs  irs told me to contact ditech financial to dispute this and request that ditech mortgage rescind the  in order  so it is not   taxable income  \n\ni contacted ditech at  spoke with a    and i told her that i did not want to be recorded but she did it anyways \ni told her that i never settled on an amount of    of a debt from a mortgage that i had back in  with ditech nor did i ever receive a  from them i asked them to rescind the   taxable income of   as this is a debt from   years ago  in which i should not be taxed for taxable income of   on a debt from  years ago however  refused to rescind  reverse  the  \nnote i have suffered with my credit from this debt for over ten years and it was removed from my debt bc it has been over the  to  year period that the debt is lawfully removed from my credit they cant go back  years later and in  send the irs and myself a  form to tax me  for a debt from over  years ago that was charged off my credit report back in the year  this is very unfair and not allowed \nalso i disputing the fact that they kept putting this on and off my credit for over  years now and it has negatively affected my credit for way over the  year period \npoint being if they had plans of taing me at  then they should have taxed me  yrs ago  before the  to  year period passed which is now way over  if they were going to send a  taxable income they should have done this years ago not in  i am requesting a complaint submitted as they have unfairly affected my credit past the  year period and it is violating my rights to be taed for something in  on a debt from the year 
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    i was a student at   from  i accumulated  of debt in student loans this school flat out lied to me i was under the assumption it would have been less than half of that since i had a lot of credits that transferred not to mention this school promised me job placement which they did not they told me i would be making a lot more than what i do they gave me false information the only thing this school did for me is put me in tons of debt and lied to me i am a single mother of   and was so excited to get my degree until i graduated and realize how much they lied to me the school is now shut down can i get these loans forgiven due to being lied to and given false information that basically pressured me into enrolling if i would have known that my student loans would have been this high i would have just stayed at the  college where i did my prerequisites i feel like i fool that  had done this to me
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              it has come to my attention the citi group is actively attempting to interfere with rights guaranteed by the constitution  nd amendment  by manipulating or denying service to certain entities or transactions that are protected by the constitution i do not expect citi group or any other financial institution to be controlling the nation s activities through such unpatriotic and unamerican behavior i do expect citi group to tend to the business of their business and stay out of the business of social tyranny
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          this banks new firearm policies run counter to laws and regulations passed by congress and they infringe and discriminate against an individuals second amendment rights such policies should not be endorsed by our federal government which instead should do business with companies that respect all of our constitutional rights including the second amendment our federal government should take all necessary steps to review and terminate its contract with citibank unless they rescind their guidelines
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    i only use my walmart store card to keep from having it cancelled my sons a college student had his card cancelled for non use i made a small purchase in  and paid it off made another small purchase in  and thought i paid it off online via wlamartcom but that seemed not to work i didnt receive a statment after that probably the mail courieri have an apartment style mailbox system where i live nevertheless so i thought all was good until credit sesame showed a past due reporti was out of town at the time so i immediately searched walmarts payment site and paid the account in full i also noticed that my  account had been reduced to a mere  just  short of my now due payment which included aprox  in past due fees so it appears that i maxed out the account by a    purchase my credit dropped  score points from  down to  from excellent to a fair rating
##                   Company State ZIP.code Submitted.via
## 1    JPMORGAN CHASE & CO.    CT    064XX           Web
## 2    Ditech Financial LLC    GA     None           Web
## 3 Navient Solutions, LLC.    IN    463XX           Web
## 4          CITIBANK, N.A.    MI    490XX           Web
## 5          CITIBANK, N.A.    MI    480XX           Web
## 6     SYNCHRONY FINANCIAL    FL    331XX           Web

Create a term matrix:

matrix <- DocumentTermMatrix(train$Consumer.complaint.narrative)
inspect(matrix)
## <<DocumentTermMatrix (documents: 90975, terms: 73921)>>
## Non-/sparse entries: 9149065/6715813910
## Sparsity           : 100%
## Maximal term length: 1963
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    and for have not that the they this was with
##   11002 104  35   41  69   54 396   61   51  48   38
##   11343 113  70   35  42  239 215   33   36 153   52
##   19650 136  73   20  23   59 254   22    6  16   47
##   25455 183  72   50  18   43 251    0   38  27   19
##   27671 134  53    5  45   55 308    3   43   8   34
##   60769 144  57   33  19  149 269   22   28  61   65
##   64243 191  50   31  59   21 197   39  107 106   50
##   68621 199  93   20  46   54 242   40   45  22   70
##   70389 162  41   34  46   85 188   49   76  29   21
##   73540 159  53   16  10   42 202   17   19   7   21

Find the most frequent words in the matrix: words that are met at least 10000 times in text:

?findFreqTerms
findFreqTerms(matrix, lowfreq = 10000)
##   [1] "about"          "account"        "after"          "amount"        
##   [5] "and"            "anything"       "bank"           "been"          
##   [9] "before"         "card"           "case"           "charge"        
##  [13] "charged"        "charges"        "chase"          "closed"        
##  [17] "did"            "didnt"          "even"           "for"           
##  [21] "fraud"          "from"           "getting"        "have"          
##  [25] "however"        "immediately"    "interest"       "last"          
##  [29] "late"           "later"          "letter"         "make"          
##  [33] "many"           "months"         "never"          "new"           
##  [37] "not"            "original"       "pay"            "sent"          
##  [41] "statement"      "stating"        "still"          "that"          
##  [45] "the"            "them"           "they"           "this"          
##  [49] "use"            "was"            "will"           "would"         
##  [53] "yet"            "also"           "asked"          "back"          
##  [57] "being"          "both"           "but"            "complaint"     
##  [61] "contact"        "contacted"      "could"          "credit"        
##  [65] "debt"           "dispute"        "done"           "due"           
##  [69] "financial"      "first"          "given"          "going"         
##  [73] "got"            "had"            "has"            "her"           
##  [77] "him"            "home"           "income"         "just"          
##  [81] "mortgage"       "now"            "off"            "one"           
##  [85] "only"           "order"          "our"            "over"          
##  [89] "past"           "point"          "receive"        "received"      
##  [93] "report"         "request"        "send"           "she"           
##  [97] "should"         "spoke"          "stated"         "submitted"     
## [101] "then"           "through"        "time"           "told"          
## [105] "very"           "want"           "way"            "went"          
## [109] "were"           "which"          "with"           "year"          
## [113] "years"          "can"            "down"           "get"           
## [117] "how"            "information"    "into"           "like"          
## [121] "loans"          "making"         "more"           "out"           
## [125] "put"            "since"          "student"        "than"          
## [129] "these"          "under"          "until"          "what"          
## [133] "where"          "any"            "are"            "business"      
## [137] "citi"           "other"          "service"        "their"         
## [141] "all"            "its"            "take"           "another"       
## [145] "fees"           "full"           "his"            "keep"          
## [149] "made"           "mail"           "online"         "paid"          
## [153] "payment"        "purchase"       "score"          "via"           
## [157] "work"           "applied"        "call"           "called"        
## [161] "check"          "fee"            "help"           "need"          
## [165] "problem"        "provided"       "representative" "response"      
## [169] "calls"          "company"        "days"           "end"           
## [173] "payments"       "phone"          "loan"           "paying"        
## [177] "regarding"      "see"            "statements"     "there"         
## [181] "advised"        "balance"        "find"           "money"         
## [185] "monthly"        "once"           "vehicle"        "when"          
## [189] "who"            "why"            "you"            "department"    
## [193] "documents"      "dont"           "every"          "issue"         
## [197] "modification"   "month"          "number"         "please"        
## [201] "process"        "funds"          "give"           "know"          
## [205] "provide"        "times"          "your"           "believe"       
## [209] "next"           "program"        "several"        "address"       
## [213] "again"          "because"        "claim"          "filed"         
## [217] "informed"       "someone"        "used"           "during"        
## [221] "found"          "without"        "rate"           "refund"        
## [225] "requested"      "same"           "today"          "total"         
## [229] "tried"          "additional"     "day"            "close"         
## [233] "bill"           "car"            "reason"         "said"          
## [237] "date"           "fraudulent"     "insurance"      "property"      
## [241] "servicing"      "already"        "correct"        "current"       
## [245] "fargo"          "foreclosure"    "wells"          "while"         
## [249] "house"          "lender"         "name"           "application"   
## [253] "navient"        "needed"         "plan"           "situation"     
## [257] "customer"       "attached"       "denied"         "email"         
## [261] "well"           "closing"        "escrow"         "took"          
## [265] "ive"            "does"           "some"           "capital"       
## [269] "file"           "two"            "different"      "accounts"      
## [273] "nothing"        "cards"          "able"           "notice"        
## [277] "explained"      "offer"          "supervisor"     "trying"        
## [281] "within"         "each"           "person"         "sale"

Remove words like “and”, “a”, “the”, “I”, “my” etc.: they are among the most frequent but they won’t help to predict the product

stopwords <- get_stopwords()
head(stopwords)
## # A tibble: 6 × 2
##   word   lexicon 
##   <chr>  <chr>   
## 1 i      snowball
## 2 me     snowball
## 3 my     snowball
## 4 myself snowball
## 5 we     snowball
## 6 our    snowball
# regular expression: \\bword\\b searches for word separated with spaces from other terms in text
train$Consumer.complaint.narrative <- gsub(paste0('\\b', stopwords$word, '\\b', collapse = '|'), '', 
                                           train$Consumer.complaint.narrative )

Create new term matrix:

matrix2 <- DocumentTermMatrix(train$Consumer.complaint.narrative)
inspect(matrix2)
## <<DocumentTermMatrix (documents: 90975, terms: 73819)>>
## Non-/sparse entries: 7015453/6708668072
## Sparsity           : 100%
## Maximal term length: 1963
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    account bank card credit loan mortgage payment payments received told
##   19650       4   34    0      3   49       10       6        7       12    0
##   25455      79    2    0     11   81       58      29        4        3    0
##   27671       2    1    0      1    4        4       8        5        1    0
##   58908      23    2    0     23   27        0      22       32        3    0
##   64243       0    0    0      0   26       15       1       10        4    2
##   68621       9    0    0     10   56        0      16        9       10    8
##   70389       1    0    0      3    2       26       0        3        0    0
##   73540       8    1    0      6   33       61       7       37        3    0
##   77518       0    0    0      0  837        0      45        1        0    2
##   84551      38    0    0      8   70       60      22       16        0    0

Remove infrequent words, to reduce matrix size:

matrix2 <- removeSparseTerms(matrix2, 0.95)
inspect(matrix2)
## <<DocumentTermMatrix (documents: 90975, terms: 340)>>
## Non-/sparse entries: 3524175/27407325
## Sparsity           : 89%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    account bank card credit loan mortgage payment payments received told
##   11002       1    5    0      1   68       42       4        4       10    1
##   11343       4    0    1      9   30        6       7        1        2   57
##   25455      79    2    0     11   81       58      29        4        3    0
##   37990      12    6    0     11   41        6       6        4       29    9
##   56802      18    7    0      0    6       20      17       17       21    8
##   60769       0    0    6      5   43        6       7        4       27   18
##   64243       0    0    0      0   26       15       1       10        4    2
##   68621       9    0    0     10   56        0      16        9       10    8
##   70174       8    1    0      7   41        5      59       32        5    9
##   77518       0    0    0      0  837        0      45        1        0    2

Convert matrix to dataframe:

matrix_df <- matrix2 %>%
  as.matrix() %>%
  as.data.frame() 

Find the most frequent words in text without “stopwords”, use that words for prediction

freq_terms <- data.frame(word = findFreqTerms(matrix2, lowfreq = 5000))

We still see some not useful words in the text; let’s remove them since they will not be useful for prediction:

custom_dict <- data.frame(word = c("almost", "anything", "didnt", "even", "however", "never", "still", "use","yet", "also", "cant", "complaint", "done", "due", "fact", "just", "something", "unable", "want", "went", "can", "get", "much", "since", "including", "another", "good", "via", "ask", "call", "help", "need", "problem", "see", "wanted", "around", "find", "assistance", "dont", "every", "may", "issue", "please", "know", "next", "either", "everything", "people", "without", "today", "upon", "additional", "date", "following", "name", "try", "recently", "always", "well", "must", "sure", "per", "able", "finally", "wasnt", "prior", "within", "right", "left", "person"))

train2 <- train
train2$Consumer.complaint.narrative <- gsub(paste0('\\b', custom_dict$word, '\\b', collapse = '|'), '', 
                                            train2$Consumer.complaint.narrative )

head(train2)
##                       Product
## 1 Credit card or prepaid card
## 2                    Mortgage
## 3                Student loan
## 4 Credit card or prepaid card
## 5 Credit card or prepaid card
## 6 Credit card or prepaid card
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Consumer.complaint.narrative
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initially  writing  chase bank  late    charge   unauthorized       amount    many letters  complaints           closed  original account  reissued  new account  sent   letter indicating  moved  charge  fraud     immediately credited    failed       months later      statement  im getting charged interest        card  pay   charges     late     pay   charge   make  refuse  chase refuses      many attempts  letters  write stating   feel   last step   begin  case       
## 2  ex husband     mobile home  home mortgage     back       horrible divorce   husband  using drugs  failing  pay  bills  led     divorce   mortgage      names     mortgage  one year   got  trailer bc     pay debt    working   time     given birth   first child      now   age    happened years ago \n\nnow   years      filled      irs   irs alerted   caused   adverse affect   irs taxes \n received  l letter   irs   stated    ditech mortgage sending    taxable income     cause     owe  irs  irs told   contact ditech financial  dispute   request  ditech mortgage rescind    order        taxable income  \n\n contacted ditech   spoke        told         recorded     anyways \n told     settled   amount       debt   mortgage    back    ditech    ever receive      asked   rescind    taxable income        debt    years ago        taxed  taxable income      debt   years ago   refused  rescind  reverse    \nnote   suffered   credit   debt   ten years    removed   debt bc         year period   debt  lawfully removed   credit   go back  years later    send  irs     form  tax     debt    years ago   charged   credit report back   year     unfair   allowed \n  disputing     kept putting      credit    years now    negatively affected  credit  way    year period \npoint     plans  taing        taxed   yrs ago       year period passed   now way      going  send   taxable income      years ago      requesting   submitted    unfairly affected  credit past   year period    violating  rights   taed       debt   year 
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           student       accumulated   debt  student loans  school flat  lied       assumption     less  half       lot  credits  transferred   mention  school promised  job placement      told     making  lot       gave  false information   thing  school     put   tons  debt  lied      single mother       excited    degree   graduated  realize    lied    school  now shut      loans forgiven    lied   given false information  basically pressured   enrolling     known   student loans     high     stayed    college     prerequisites  feel like  fool       
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   come   attention  citi group  actively attempting  interfere  rights guaranteed   constitution  nd amendment   manipulating  denying service  certain entities  transactions   protected   constitution    expect citi group    financial institution   controlling  nation s activities   unpatriotic  unamerican behavior   expect citi group  tend   business   business  stay    business  social tyranny
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          banks new firearm policies run counter  laws  regulations passed  congress   infringe  discriminate   individuals second amendment rights  policies    endorsed   federal government  instead   business  companies  respect    constitutional rights   second amendment  federal government  take  necessary steps  review  terminate  contract  citibank unless  rescind  guidelines
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           walmart store card  keep    cancelled  sons  college student   card cancelled  non   made  small purchase    paid   made  small purchase    thought  paid   online  wlamartcom   seemed   work   receive  statment   probably  mail courieri   apartment style mailbox system   live nevertheless   thought     credit sesame showed  past  reporti    town   time   immediately searched walmarts payment site  paid  account  full   noticed    account   reduced   mere    short   now  payment  included aprox   past  fees   appears   maxed   account      purchase  credit dropped  score points       excellent   fair rating
##                   Company State ZIP.code Submitted.via
## 1    JPMORGAN CHASE & CO.    CT    064XX           Web
## 2    Ditech Financial LLC    GA     None           Web
## 3 Navient Solutions, LLC.    IN    463XX           Web
## 4          CITIBANK, N.A.    MI    490XX           Web
## 5          CITIBANK, N.A.    MI    480XX           Web
## 6     SYNCHRONY FINANCIAL    FL    331XX           Web

Create new term matrix:

matrix3 <- DocumentTermMatrix(train2$Consumer.complaint.narrative)
inspect(matrix3)
## <<DocumentTermMatrix (documents: 90975, terms: 73748)>>
## Non-/sparse entries: 6244827/6702979473
## Sparsity           : 100%
## Maximal term length: 1963
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    account bank card credit loan mortgage payment payments received told
##   19650       4   34    0      3   49       10       6        7       12    0
##   25455      79    2    0     11   81       58      29        4        3    0
##   27671       2    1    0      1    4        4       8        5        1    0
##   33614       2   32    2      6    1       12       1        0        3    0
##   58908      23    2    0     23   27        0      22       32        3    0
##   64243       0    0    0      0   26       15       1       10        4    2
##   68621       9    0    0     10   56        0      16        9       10    8
##   73540       8    1    0      6   33       61       7       37        3    0
##   77518       0    0    0      0  837        0      45        1        0    2
##   84551      38    0    0      8   70       60      22       16        0    0

Remove infrequent words, to reduce matrix size:

matrix3 <- removeSparseTerms(matrix3, 0.95)
inspect(matrix3)
## <<DocumentTermMatrix (documents: 90975, terms: 270)>>
## Non-/sparse entries: 2753565/21809685
## Sparsity           : 89%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    account bank card credit loan mortgage payment payments received told
##   11002       1    5    0      1   68       42       4        4       10    1
##   15879       2   73    0      0   19       12       2        6       38    8
##   19650       4   34    0      3   49       10       6        7       12    0
##   25455      79    2    0     11   81       58      29        4        3    0
##   37990      12    6    0     11   41        6       6        4       29    9
##   64243       0    0    0      0   26       15       1       10        4    2
##   68621       9    0    0     10   56        0      16        9       10    8
##   70174       8    1    0      7   41        5      59       32        5    9
##   77518       0    0    0      0  837        0      45        1        0    2
##   84551      38    0    0      8   70       60      22       16        0    0

Convert matrix to dataframe:

matrix_df <- matrix3 %>%
  as.matrix() %>%
  as.data.frame() 

Find the most frequent words in text without “stopwords”:

freq_terms <- data.frame(word = findFreqTerms(matrix3, lowfreq = 5000))

Merge with train2 dataset, to get the Product column back:

matrix_df <- matrix_df %>%
  as.matrix() %>%
  as.data.frame() %>% 
  bind_cols(Product=train2$Product) %>% 
  select(Product, everything())

Get a snaphost of the matrix, and some information:

skim(matrix_df)
Data summary
Name matrix_df
Number of rows 90975
Number of columns 271
_______________________
Column type frequency:
factor 1
numeric 270
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Product 0 1 FALSE 4 Cre: 38294, Mor: 30957, Stu: 12485, Veh: 9239

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
account 0 1 1.30 2.56 0 0 0 2 165 ▇▁▁▁▁
amount 0 1 0.49 1.30 0 0 0 1 98 ▇▁▁▁▁
bank 0 1 0.69 2.02 0 0 0 1 82 ▇▁▁▁▁
card 0 1 1.03 2.42 0 0 0 1 77 ▇▁▁▁▁
case 0 1 0.15 0.67 0 0 0 0 36 ▇▁▁▁▁
charge 0 1 0.25 0.96 0 0 0 0 61 ▇▁▁▁▁
charged 0 1 0.18 0.62 0 0 0 0 30 ▇▁▁▁▁
charges 0 1 0.24 1.05 0 0 0 0 51 ▇▁▁▁▁
chase 0 1 0.24 1.56 0 0 0 0 59 ▇▁▁▁▁
closed 0 1 0.21 0.71 0 0 0 0 25 ▇▁▁▁▁
feel 0 1 0.10 0.37 0 0 0 0 15 ▇▁▁▁▁
fraud 0 1 0.18 0.91 0 0 0 0 38 ▇▁▁▁▁
getting 0 1 0.11 0.38 0 0 0 0 11 ▇▁▁▁▁
immediately 0 1 0.12 0.42 0 0 0 0 16 ▇▁▁▁▁
interest 0 1 0.46 1.44 0 0 0 0 70 ▇▁▁▁▁
last 0 1 0.19 0.53 0 0 0 0 14 ▇▁▁▁▁
late 0 1 0.43 1.20 0 0 0 0 37 ▇▁▁▁▁
later 0 1 0.15 0.46 0 0 0 0 12 ▇▁▁▁▁
letter 0 1 0.44 1.25 0 0 0 0 44 ▇▁▁▁▁
make 0 1 0.37 0.90 0 0 0 0 24 ▇▁▁▁▁
many 0 1 0.12 0.45 0 0 0 0 12 ▇▁▁▁▁
months 0 1 0.33 0.82 0 0 0 0 19 ▇▁▁▁▁
new 0 1 0.32 0.94 0 0 0 0 32 ▇▁▁▁▁
original 0 1 0.11 0.49 0 0 0 0 22 ▇▁▁▁▁
pay 0 1 0.57 1.20 0 0 0 1 28 ▇▁▁▁▁
sent 0 1 0.47 1.11 0 0 0 1 53 ▇▁▁▁▁
statement 0 1 0.26 0.93 0 0 0 0 30 ▇▁▁▁▁
stating 0 1 0.16 0.53 0 0 0 0 27 ▇▁▁▁▁
ago 0 1 0.08 0.33 0 0 0 0 12 ▇▁▁▁▁
asked 0 1 0.36 0.93 0 0 0 0 30 ▇▁▁▁▁
back 0 1 0.56 1.23 0 0 0 1 28 ▇▁▁▁▁
contact 0 1 0.20 0.88 0 0 0 0 187 ▇▁▁▁▁
contacted 0 1 0.24 0.65 0 0 0 0 23 ▇▁▁▁▁
credit 0 1 1.36 2.50 0 0 0 2 100 ▇▁▁▁▁
debt 0 1 0.14 0.80 0 0 0 0 48 ▇▁▁▁▁
dispute 0 1 0.20 1.04 0 0 0 0 75 ▇▁▁▁▁
ever 0 1 0.08 0.33 0 0 0 0 18 ▇▁▁▁▁
financial 0 1 0.19 0.78 0 0 0 0 31 ▇▁▁▁▁
first 0 1 0.25 0.70 0 0 0 0 21 ▇▁▁▁▁
form 0 1 0.08 0.48 0 0 0 0 32 ▇▁▁▁▁
given 0 1 0.12 0.44 0 0 0 0 21 ▇▁▁▁▁
going 0 1 0.19 0.60 0 0 0 0 42 ▇▁▁▁▁
got 0 1 0.19 0.60 0 0 0 0 21 ▇▁▁▁▁
happened 0 1 0.08 0.33 0 0 0 0 10 ▇▁▁▁▁
home 0 1 0.40 1.36 0 0 0 0 48 ▇▁▁▁▁
income 0 1 0.13 0.74 0 0 0 0 56 ▇▁▁▁▁
mortgage 0 1 0.81 2.36 0 0 0 1 130 ▇▁▁▁▁
note 0 1 0.09 0.70 0 0 0 0 64 ▇▁▁▁▁
now 0 1 0.44 0.88 0 0 0 1 21 ▇▁▁▁▁
one 0 1 0.58 1.42 0 0 0 1 91 ▇▁▁▁▁
order 0 1 0.12 0.59 0 0 0 0 57 ▇▁▁▁▁
owe 0 1 0.08 0.37 0 0 0 0 12 ▇▁▁▁▁
past 0 1 0.14 0.54 0 0 0 0 33 ▇▁▁▁▁
period 0 1 0.09 0.47 0 0 0 0 34 ▇▁▁▁▁
point 0 1 0.11 0.48 0 0 0 0 21 ▇▁▁▁▁
receive 0 1 0.19 0.57 0 0 0 0 20 ▇▁▁▁▁
received 0 1 0.74 1.45 0 0 0 1 49 ▇▁▁▁▁
refused 0 1 0.11 0.42 0 0 0 0 11 ▇▁▁▁▁
removed 0 1 0.08 0.39 0 0 0 0 13 ▇▁▁▁▁
report 0 1 0.25 0.76 0 0 0 0 34 ▇▁▁▁▁
request 0 1 0.25 0.88 0 0 0 0 48 ▇▁▁▁▁
requesting 0 1 0.07 0.31 0 0 0 0 9 ▇▁▁▁▁
send 0 1 0.19 0.60 0 0 0 0 17 ▇▁▁▁▁
spoke 0 1 0.21 0.68 0 0 0 0 38 ▇▁▁▁▁
stated 0 1 0.24 0.86 0 0 0 0 52 ▇▁▁▁▁
submitted 0 1 0.12 0.53 0 0 0 0 27 ▇▁▁▁▁
time 0 1 0.69 1.25 0 0 0 1 50 ▇▁▁▁▁
told 0 1 0.84 1.70 0 0 0 1 57 ▇▁▁▁▁
using 0 1 0.09 0.36 0 0 0 0 10 ▇▁▁▁▁
way 0 1 0.14 0.45 0 0 0 0 10 ▇▁▁▁▁
working 0 1 0.08 0.35 0 0 0 0 14 ▇▁▁▁▁
year 0 1 0.21 0.65 0 0 0 0 17 ▇▁▁▁▁
years 0 1 0.29 0.82 0 0 0 0 30 ▇▁▁▁▁
gave 0 1 0.09 0.37 0 0 0 0 10 ▇▁▁▁▁
information 0 1 0.47 1.24 0 0 0 0 60 ▇▁▁▁▁
like 0 1 0.21 0.60 0 0 0 0 23 ▇▁▁▁▁
loans 0 1 0.28 1.22 0 0 0 0 58 ▇▁▁▁▁
making 0 1 0.14 0.45 0 0 0 0 14 ▇▁▁▁▁
put 0 1 0.16 0.50 0 0 0 0 10 ▇▁▁▁▁
student 0 1 0.14 0.72 0 0 0 0 53 ▇▁▁▁▁
transferred 0 1 0.11 0.45 0 0 0 0 12 ▇▁▁▁▁
business 0 1 0.17 0.65 0 0 0 0 36 ▇▁▁▁▁
service 0 1 0.31 0.88 0 0 0 0 47 ▇▁▁▁▁
federal 0 1 0.10 0.65 0 0 0 0 52 ▇▁▁▁▁
instead 0 1 0.08 0.34 0 0 0 0 7 ▇▁▁▁▁
review 0 1 0.11 0.54 0 0 0 0 28 ▇▁▁▁▁
second 0 1 0.09 0.42 0 0 0 0 17 ▇▁▁▁▁
take 0 1 0.20 0.56 0 0 0 0 16 ▇▁▁▁▁
fees 0 1 0.24 0.98 0 0 0 0 67 ▇▁▁▁▁
full 0 1 0.18 0.62 0 0 0 0 25 ▇▁▁▁▁
keep 0 1 0.11 0.40 0 0 0 0 13 ▇▁▁▁▁
made 0 1 0.53 1.10 0 0 0 1 31 ▇▁▁▁▁
mail 0 1 0.15 0.54 0 0 0 0 16 ▇▁▁▁▁
online 0 1 0.18 0.64 0 0 0 0 25 ▇▁▁▁▁
paid 0 1 0.49 1.12 0 0 0 1 30 ▇▁▁▁▁
payment 0 1 1.38 2.84 0 0 0 2 81 ▇▁▁▁▁
purchase 0 1 0.15 0.63 0 0 0 0 20 ▇▁▁▁▁
score 0 1 0.11 0.49 0 0 0 0 20 ▇▁▁▁▁
system 0 1 0.10 0.48 0 0 0 0 14 ▇▁▁▁▁
thought 0 1 0.06 0.27 0 0 0 0 6 ▇▁▁▁▁
work 0 1 0.18 0.62 0 0 0 0 27 ▇▁▁▁▁
answer 0 1 0.08 0.36 0 0 0 0 14 ▇▁▁▁▁
applied 0 1 0.20 0.69 0 0 0 0 40 ▇▁▁▁▁
called 0 1 0.68 1.28 0 0 0 1 32 ▇▁▁▁▁
check 0 1 0.28 1.16 0 0 0 0 38 ▇▁▁▁▁
fee 0 1 0.23 0.95 0 0 0 0 38 ▇▁▁▁▁
previous 0 1 0.08 0.35 0 0 0 0 12 ▇▁▁▁▁
provided 0 1 0.17 0.62 0 0 0 0 31 ▇▁▁▁▁
representative 0 1 0.20 0.76 0 0 0 0 24 ▇▁▁▁▁
response 0 1 0.15 0.63 0 0 0 0 50 ▇▁▁▁▁
started 0 1 0.09 0.37 0 0 0 0 22 ▇▁▁▁▁
three 0 1 0.10 0.43 0 0 0 0 28 ▇▁▁▁▁
calls 0 1 0.15 0.53 0 0 0 0 24 ▇▁▁▁▁
company 0 1 0.57 1.29 0 0 0 1 50 ▇▁▁▁▁
days 0 1 0.38 0.93 0 0 0 0 30 ▇▁▁▁▁
end 0 1 0.12 0.46 0 0 0 0 15 ▇▁▁▁▁
hours 0 1 0.09 0.40 0 0 0 0 18 ▇▁▁▁▁
payments 0 1 0.81 1.80 0 0 0 1 45 ▇▁▁▁▁
phone 0 1 0.36 0.93 0 0 0 0 34 ▇▁▁▁▁
agreement 0 1 0.10 0.63 0 0 0 0 32 ▇▁▁▁▁
asking 0 1 0.09 0.36 0 0 0 0 10 ▇▁▁▁▁
loan 0 1 1.29 4.07 0 0 0 1 837 ▇▁▁▁▁
office 0 1 0.10 0.52 0 0 0 0 22 ▇▁▁▁▁
paperwork 0 1 0.09 0.51 0 0 0 0 17 ▇▁▁▁▁
paying 0 1 0.17 0.53 0 0 0 0 20 ▇▁▁▁▁
regarding 0 1 0.12 0.46 0 0 0 0 14 ▇▁▁▁▁
statements 0 1 0.11 0.54 0 0 0 0 27 ▇▁▁▁▁
understand 0 1 0.08 0.32 0 0 0 0 9 ▇▁▁▁▁
advised 0 1 0.12 0.68 0 0 0 0 37 ▇▁▁▁▁
balance 0 1 0.42 1.32 0 0 0 0 81 ▇▁▁▁▁
money 0 1 0.34 1.00 0 0 0 0 44 ▇▁▁▁▁
monthly 0 1 0.21 0.76 0 0 0 0 30 ▇▁▁▁▁
rep 0 1 0.09 0.52 0 0 0 0 24 ▇▁▁▁▁
reporting 0 1 0.11 0.49 0 0 0 0 23 ▇▁▁▁▁
returned 0 1 0.10 0.48 0 0 0 0 39 ▇▁▁▁▁
website 0 1 0.11 0.48 0 0 0 0 13 ▇▁▁▁▁
week 0 1 0.10 0.39 0 0 0 0 21 ▇▁▁▁▁
complete 0 1 0.07 0.36 0 0 0 0 13 ▇▁▁▁▁
department 0 1 0.19 0.76 0 0 0 0 32 ▇▁▁▁▁
documents 0 1 0.21 0.95 0 0 0 0 68 ▇▁▁▁▁
modification 0 1 0.20 1.18 0 0 0 0 42 ▇▁▁▁▁
month 0 1 0.37 0.92 0 0 0 0 21 ▇▁▁▁▁
number 0 1 0.30 0.95 0 0 0 0 26 ▇▁▁▁▁
process 0 1 0.22 0.72 0 0 0 0 21 ▇▁▁▁▁
stop 0 1 0.07 0.34 0 0 0 0 14 ▇▁▁▁▁
funds 0 1 0.14 0.71 0 0 0 0 45 ▇▁▁▁▁
give 0 1 0.13 0.43 0 0 0 0 10 ▇▁▁▁▁
matter 0 1 0.11 0.43 0 0 0 0 16 ▇▁▁▁▁
provide 0 1 0.16 0.58 0 0 0 0 22 ▇▁▁▁▁
say 0 1 0.09 0.35 0 0 0 0 10 ▇▁▁▁▁
thank 0 1 0.08 0.36 0 0 0 0 25 ▇▁▁▁▁
times 0 1 0.24 0.58 0 0 0 0 16 ▇▁▁▁▁
transaction 0 1 0.10 0.62 0 0 0 0 36 ▇▁▁▁▁
believe 0 1 0.12 0.43 0 0 0 0 11 ▇▁▁▁▁
currently 0 1 0.06 0.28 0 0 0 0 10 ▇▁▁▁▁
program 0 1 0.11 0.67 0 0 0 0 33 ▇▁▁▁▁
services 0 1 0.10 0.60 0 0 0 0 48 ▇▁▁▁▁
several 0 1 0.19 0.53 0 0 0 0 11 ▇▁▁▁▁
address 0 1 0.16 0.75 0 0 0 0 68 ▇▁▁▁▁
calling 0 1 0.07 0.31 0 0 0 0 11 ▇▁▁▁▁
claim 0 1 0.13 0.68 0 0 0 0 44 ▇▁▁▁▁
filed 0 1 0.12 0.47 0 0 0 0 21 ▇▁▁▁▁
informed 0 1 0.17 0.66 0 0 0 0 22 ▇▁▁▁▁
let 0 1 0.07 0.34 0 0 0 0 21 ▇▁▁▁▁
someone 0 1 0.13 0.45 0 0 0 0 10 ▇▁▁▁▁
status 0 1 0.08 0.60 0 0 0 0 130 ▇▁▁▁▁
used 0 1 0.14 0.49 0 0 0 0 13 ▇▁▁▁▁
continue 0 1 0.08 0.34 0 0 0 0 10 ▇▁▁▁▁
found 0 1 0.11 0.39 0 0 0 0 9 ▇▁▁▁▁
taken 0 1 0.09 0.37 0 0 0 0 14 ▇▁▁▁▁
based 0 1 0.09 0.40 0 0 0 0 12 ▇▁▁▁▁
cfpb 0 1 0.09 0.58 0 0 0 0 43 ▇▁▁▁▁
rate 0 1 0.19 0.94 0 0 0 0 62 ▇▁▁▁▁
refund 0 1 0.15 0.74 0 0 0 0 32 ▇▁▁▁▁
requested 0 1 0.20 0.63 0 0 0 0 40 ▇▁▁▁▁
resolve 0 1 0.09 0.36 0 0 0 0 14 ▇▁▁▁▁
total 0 1 0.11 0.55 0 0 0 0 65 ▇▁▁▁▁
tried 0 1 0.16 0.47 0 0 0 0 11 ▇▁▁▁▁
change 0 1 0.09 0.44 0 0 0 0 16 ▇▁▁▁▁
changed 0 1 0.07 0.34 0 0 0 0 12 ▇▁▁▁▁
day 0 1 0.24 0.69 0 0 0 0 15 ▇▁▁▁▁
notified 0 1 0.06 0.31 0 0 0 0 13 ▇▁▁▁▁
wrong 0 1 0.08 0.35 0 0 0 0 8 ▇▁▁▁▁
close 0 1 0.12 0.55 0 0 0 0 32 ▇▁▁▁▁
offered 0 1 0.07 0.33 0 0 0 0 17 ▇▁▁▁▁
opened 0 1 0.08 0.39 0 0 0 0 13 ▇▁▁▁▁
bill 0 1 0.15 0.66 0 0 0 0 38 ▇▁▁▁▁
car 0 1 0.20 1.23 0 0 0 0 80 ▇▁▁▁▁
hold 0 1 0.11 0.46 0 0 0 0 19 ▇▁▁▁▁
reason 0 1 0.11 0.41 0 0 0 0 11 ▇▁▁▁▁
said 0 1 0.50 1.36 0 0 0 0 46 ▇▁▁▁▁
agent 0 1 0.10 0.61 0 0 0 0 31 ▇▁▁▁▁
fraudulent 0 1 0.11 0.59 0 0 0 0 19 ▇▁▁▁▁
insurance 0 1 0.24 1.26 0 0 0 0 57 ▇▁▁▁▁
property 0 1 0.21 1.16 0 0 0 0 64 ▇▁▁▁▁
servicing 0 1 0.13 0.79 0 0 0 0 65 ▇▁▁▁▁
already 0 1 0.13 0.43 0 0 0 0 11 ▇▁▁▁▁
correct 0 1 0.11 0.44 0 0 0 0 15 ▇▁▁▁▁
current 0 1 0.14 0.54 0 0 0 0 20 ▇▁▁▁▁
dated 0 1 0.08 0.46 0 0 0 0 26 ▇▁▁▁▁
foreclosure 0 1 0.14 0.77 0 0 0 0 39 ▇▁▁▁▁
law 0 1 0.08 0.62 0 0 0 0 79 ▇▁▁▁▁
part 0 1 0.07 0.33 0 0 0 0 15 ▇▁▁▁▁
place 0 1 0.07 0.32 0 0 0 0 16 ▇▁▁▁▁
required 0 1 0.10 0.43 0 0 0 0 21 ▇▁▁▁▁
came 0 1 0.07 0.32 0 0 0 0 9 ▇▁▁▁▁
house 0 1 0.13 0.68 0 0 0 0 36 ▇▁▁▁▁
lender 0 1 0.11 0.66 0 0 0 0 29 ▇▁▁▁▁
application 0 1 0.17 0.85 0 0 0 0 32 ▇▁▁▁▁
apply 0 1 0.08 0.38 0 0 0 0 16 ▇▁▁▁▁
needed 0 1 0.15 0.54 0 0 0 0 18 ▇▁▁▁▁
plan 0 1 0.12 0.72 0 0 0 0 77 ▇▁▁▁▁
situation 0 1 0.12 0.45 0 0 0 0 21 ▇▁▁▁▁
customer 0 1 0.30 0.88 0 0 0 0 28 ▇▁▁▁▁
long 0 1 0.07 0.30 0 0 0 0 7 ▇▁▁▁▁
multiple 0 1 0.10 0.40 0 0 0 0 11 ▇▁▁▁▁
saying 0 1 0.11 0.39 0 0 0 0 9 ▇▁▁▁▁
terms 0 1 0.09 0.49 0 0 0 0 20 ▇▁▁▁▁
weeks 0 1 0.11 0.41 0 0 0 0 13 ▇▁▁▁▁
attached 0 1 0.11 0.51 0 0 0 0 21 ▇▁▁▁▁
available 0 1 0.08 0.39 0 0 0 0 23 ▇▁▁▁▁
denied 0 1 0.11 0.46 0 0 0 0 16 ▇▁▁▁▁
email 0 1 0.30 1.07 0 0 0 0 54 ▇▁▁▁▁
error 0 1 0.10 0.51 0 0 0 0 29 ▇▁▁▁▁
set 0 1 0.09 0.43 0 0 0 0 14 ▇▁▁▁▁
closing 0 1 0.14 0.88 0 0 0 0 51 ▇▁▁▁▁
sold 0 1 0.08 0.39 0 0 0 0 15 ▇▁▁▁▁
tell 0 1 0.09 0.39 0 0 0 0 19 ▇▁▁▁▁
escrow 0 1 0.22 1.20 0 0 0 0 55 ▇▁▁▁▁
took 0 1 0.14 0.45 0 0 0 0 11 ▇▁▁▁▁
attorney 0 1 0.10 0.62 0 0 0 0 41 ▇▁▁▁▁
ive 0 1 0.12 0.50 0 0 0 0 16 ▇▁▁▁▁
resolved 0 1 0.07 0.34 0 0 0 0 17 ▇▁▁▁▁
reported 0 1 0.11 0.44 0 0 0 0 22 ▇▁▁▁▁
state 0 1 0.09 0.51 0 0 0 0 49 ▇▁▁▁▁
states 0 1 0.08 0.52 0 0 0 0 59 ▇▁▁▁▁
file 0 1 0.13 0.56 0 0 0 0 35 ▇▁▁▁▁
though 0 1 0.10 0.38 0 0 0 0 10 ▇▁▁▁▁
two 0 1 0.23 0.66 0 0 0 0 32 ▇▁▁▁▁
different 0 1 0.12 0.43 0 0 0 0 14 ▇▁▁▁▁
longer 0 1 0.07 0.31 0 0 0 0 10 ▇▁▁▁▁
receiving 0 1 0.07 0.31 0 0 0 0 9 ▇▁▁▁▁
accounts 0 1 0.12 0.61 0 0 0 0 27 ▇▁▁▁▁
consumer 0 1 0.11 0.61 0 0 0 0 51 ▇▁▁▁▁
documentation 0 1 0.11 0.51 0 0 0 0 22 ▇▁▁▁▁
nothing 0 1 0.15 0.46 0 0 0 0 11 ▇▁▁▁▁
return 0 1 0.09 0.46 0 0 0 0 21 ▇▁▁▁▁
cards 0 1 0.12 0.62 0 0 0 0 35 ▇▁▁▁▁
line 0 1 0.10 0.47 0 0 0 0 25 ▇▁▁▁▁
notice 0 1 0.13 0.58 0 0 0 0 29 ▇▁▁▁▁
signed 0 1 0.11 0.47 0 0 0 0 45 ▇▁▁▁▁
explained 0 1 0.11 0.49 0 0 0 0 16 ▇▁▁▁▁
proof 0 1 0.11 0.51 0 0 0 0 28 ▇▁▁▁▁
approved 0 1 0.10 0.46 0 0 0 0 15 ▇▁▁▁▁
offer 0 1 0.12 0.67 0 0 0 0 37 ▇▁▁▁▁
supervisor 0 1 0.13 0.59 0 0 0 0 27 ▇▁▁▁▁
trying 0 1 0.15 0.46 0 0 0 0 17 ▇▁▁▁▁
agreed 0 1 0.08 0.38 0 0 0 0 12 ▇▁▁▁▁
copy 0 1 0.10 0.51 0 0 0 0 45 ▇▁▁▁▁
issues 0 1 0.09 0.40 0 0 0 0 19 ▇▁▁▁▁
history 0 1 0.08 0.43 0 0 0 0 31 ▇▁▁▁▁
show 0 1 0.07 0.33 0 0 0 0 19 ▇▁▁▁▁
transfer 0 1 0.11 0.66 0 0 0 0 38 ▇▁▁▁▁
look 0 1 0.06 0.29 0 0 0 0 10 ▇▁▁▁▁
purchased 0 1 0.08 0.37 0 0 0 0 13 ▇▁▁▁▁
care 0 1 0.08 0.42 0 0 0 0 18 ▇▁▁▁▁
owed 0 1 0.08 0.38 0 0 0 0 21 ▇▁▁▁▁
manager 0 1 0.11 0.53 0 0 0 0 16 ▇▁▁▁▁
speak 0 1 0.11 0.44 0 0 0 0 15 ▇▁▁▁▁
showing 0 1 0.08 0.37 0 0 0 0 14 ▇▁▁▁▁

Machine Learning Step #1: Data splitting

Split into training and testing dataset:

set.seed(9876)

split <- initial_split(matrix_df, strata=Product, prop=2/3)

training <- training(split)
testing <- testing(split)

count(training) # 60649 records
##       n
## 1 60649
count(testing) # 30326 records
##       n
## 1 30326
# approximately 2:1, as we specified 

Create 20 folds for training:

set.seed(9876)

vfold_train <- rsample::vfold_cv(data = training, v = 20)
vfold_train
## #  20-fold cross-validation 
## # A tibble: 20 × 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [57616/3033]> Fold01
##  2 <split [57616/3033]> Fold02
##  3 <split [57616/3033]> Fold03
##  4 <split [57616/3033]> Fold04
##  5 <split [57616/3033]> Fold05
##  6 <split [57616/3033]> Fold06
##  7 <split [57616/3033]> Fold07
##  8 <split [57616/3033]> Fold08
##  9 <split [57616/3033]> Fold09
## 10 <split [57617/3032]> Fold10
## 11 <split [57617/3032]> Fold11
## 12 <split [57617/3032]> Fold12
## 13 <split [57617/3032]> Fold13
## 14 <split [57617/3032]> Fold14
## 15 <split [57617/3032]> Fold15
## 16 <split [57617/3032]> Fold16
## 17 <split [57617/3032]> Fold17
## 18 <split [57617/3032]> Fold18
## 19 <split [57617/3032]> Fold19
## 20 <split [57617/3032]> Fold20
pull(vfold_train, splits)
## [[1]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[2]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[3]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[4]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[5]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[6]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[7]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[8]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[9]]
## <Analysis/Assess/Total>
## <57616/3033/60649>
## 
## [[10]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[11]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[12]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[13]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[14]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[15]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[16]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[17]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[18]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[19]]
## <Analysis/Assess/Total>
## <57617/3032/60649>
## 
## [[20]]
## <Analysis/Assess/Total>
## <57617/3032/60649>

Creating a recipe, selecting a model

Specify variables with the recipe() function

Create a recipe:

recipe <- training %>% recipe(Product ~ .) 
# the . notation indicates that the rest of the variables are predictors

Create a model: decision tree

model <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine("rpart")
model
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Create a workflow:

workflow <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(model) 
workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Model fit: without cross validation

fit <- fit(workflow, data=training)

workflow_fit<-fit %>%
  pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
workflow_fit
## parsnip model object
## 
## n= 60649 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 60649 35120 Credit card or prepaid card (0.420930271 0.340285907 0.137232271 0.101551551)  
##     2) mortgage< 0.5 45394 20220 Credit card or prepaid card (0.554566683 0.131999824 0.179803498 0.133629995)  
##       4) card>=0.5 18430   527 Credit card or prepaid card (0.971405317 0.005154639 0.011448725 0.011991319) *
##       5) card< 0.5 26964 19013 Student loan (0.269655837 0.218699006 0.294874648 0.216770509)  
##        10) student< 0.5 23189 15941 Credit card or prepaid card (0.312561991 0.253697874 0.182586571 0.251153564)  
##          20) car< 0.5 20234 13097 Credit card or prepaid card (0.352723139 0.288030048 0.206385292 0.152861520)  
##            40) loan< 0.5 12434  5672 Credit card or prepaid card (0.543831430 0.195110182 0.131413865 0.129644523)  
##              80) loans< 0.5 11725  4978 Credit card or prepaid card (0.575437100 0.197697228 0.091940299 0.134925373)  
##               160) escrow< 0.5 11319  4579 Credit card or prepaid card (0.595458963 0.169714639 0.095238095 0.139588303) *
##               161) escrow>=0.5 406     9 Mortgage (0.017241379 0.977832512 0.000000000 0.004926108) *
##              81) loans>=0.5 709   153 Student loan (0.021156559 0.152327221 0.784203103 0.042313117) *
##            41) loan>=0.5 7800  4398 Mortgage (0.048076923 0.436153846 0.325897436 0.189871795)  
##              82) home>=0.5 1534   172 Mortgage (0.012385919 0.887874837 0.046936115 0.052803129) *
##              83) home< 0.5 6266  3796 Student loan (0.056814555 0.325566550 0.394190871 0.223428024)  
##               166) loans< 0.5 4986  3082 Mortgage (0.070196550 0.381869234 0.285800241 0.262133975) *
##               167) loans>=0.5 1280   235 Student loan (0.004687500 0.106250000 0.816406250 0.072656250) *
##          21) car>=0.5 2955   224 Vehicle loan or lease (0.037563452 0.018612521 0.019627750 0.924196277) *
##        11) student>=0.5 3775    58 Student loan (0.006092715 0.003708609 0.984635762 0.005562914) *
##     3) mortgage>=0.5 15255   609 Mortgage (0.023271059 0.960078663 0.010553917 0.006096362) *

See which variables are important:

workflow_fit$fit$variable.importance
##     mortgage         card         loan      student         home       escrow 
## 1.156521e+04 7.228772e+03 2.931520e+03 2.792036e+03 2.213974e+03 2.145142e+03 
##          car       credit     property        loans modification  foreclosure 
## 2.067258e+03 1.989351e+03 1.680765e+03 1.608343e+03 1.402037e+03 1.358019e+03 
##      charges       charge        cards    servicing  application      program 
## 8.585883e+02 8.327012e+02 8.260333e+02 1.277481e+02 1.084374e+02 2.736204e+01 
##        years      federal      closing       making          got        close 
## 2.313052e+01 1.288626e+01 1.127550e+01 8.875347e+00 7.695376e+00 6.765302e+00 
##        house         sold     payments       issues         back         plan 
## 6.765302e+00 4.197478e+00 4.034963e+00 3.497898e+00 2.798318e+00 1.898763e+00 
##         form 
## 7.516392e-01

We see such terms as mortgage, card, loan, student, home, escrow, car, credit are among the most important

Perform a prediction:

predicted <- predict(workflow_fit, new_data = training)
accuracy(training, truth = Product, estimate = predicted$.pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.841

Model accuracy is 0.841 -> 84.1%

Compare counts of Product vs. predicted Product:

count(training, Product)
##                       Product     n
## 1 Credit card or prepaid card 25529
## 2                    Mortgage 20638
## 3                Student loan  8323
## 4       Vehicle loan or lease  6159
count(predicted, .pred_class)
## # A tibble: 4 × 2
##   .pred_class                     n
##   <fct>                       <int>
## 1 Credit card or prepaid card 29749
## 2 Mortgage                    22181
## 3 Student loan                 5764
## 4 Vehicle loan or lease        2955

Compare by row:

predicted_vs_true <- bind_cols(training, predicted_Product=pull(predicted, .pred_class)) %>%
  select(predicted_Product, everything())
head(predicted_vs_true)
##              predicted_Product                     Product account amount bank
## 6  Credit card or prepaid card Credit card or prepaid card       3      0    0
## 8  Credit card or prepaid card Credit card or prepaid card       1      0    0
## 12 Credit card or prepaid card Credit card or prepaid card       0      0    0
## 16 Credit card or prepaid card Credit card or prepaid card       2      0    0
## 17 Credit card or prepaid card Credit card or prepaid card       1      0    0
## 18 Credit card or prepaid card Credit card or prepaid card       3      1    3
##    card case charge charged charges chase closed feel fraud getting immediately
## 6     2    0      0       0       0     0      0    0     0       0           1
## 8     0    0      0       0       1     0      0    0     0       0           0
## 12    4    0      0       0       0     0      0    0     1       1           0
## 16    0    1      0       2       0     0      0    0     0       0           0
## 17    7    0      0       0       0     0      0    0     1       0           1
## 18    5    1      1       0       0     0      4    0     0       0           1
##    interest last late later letter make many months new original pay sent
## 6         0    0    0     0      0    0    0      0   0        0   0    0
## 8         0    0    0     0      0    0    0      0   0        0   0    0
## 12        0    0    0     1      0    0    1      1   0        1   1    0
## 16        7    0    2     1      0    0    0      2   0        0   0    0
## 17        0    0    0     0      0    0    0      0   4        0   1    0
## 18        0    0    0     0      2    0    0      0   0        0   0    1
##    statement stating ago asked back contact contacted credit debt dispute ever
## 6          0       0   0     0    0       0         0      2    0       0    0
## 8          0       0   0     0    0       0         0      1    0       0    1
## 12         0       0   0     1    1       0         0      4    0       0    0
## 16         0       0   0     0    1       0         0      0    0       0    0
## 17         0       0   0     0    0       0         0      2    0       0    0
## 18         0       0   0     0    0       0         0      1    0       0    0
##    financial first form given going got happened home income mortgage note now
## 6          0     0    0     0     0   0        0    0      0        0    0   1
## 8          0     0    0     0     0   0        0    0      0        0    0   0
## 12         0     0    0     0     0   1        0    0      0        0    0   1
## 16         0     0    0     0     0   0        0    0      0        0    0   0
## 17         0     0    0     0     0   0        0    1      0        0    0   0
## 18         0     1    0     0     1   0        0    0      0        0    0   0
##    one order owe past period point receive received refused removed report
## 6    0     0   0    2      0     0       1        0       0       0      0
## 8    0     0   0    0      0     0       0        0       0       0      0
## 12   0     0   0    1      0     0       0        1       0       0      0
## 16   0     0   0    0      0     0       0        0       0       0      0
## 17   0     1   0    0      0     0       0        0       0       0      1
## 18   0     0   0    0      0     0       0        1       0       0      0
##    request requesting send spoke stated submitted time told using way working
## 6        0          0    0     0      0         0    1    0     0   0       0
## 8        0          0    0     0      0         0    0    0     0   0       0
## 12       0          0    0     0      0         0    0    1     0   1       0
## 16       1          0    0     0      0         0    1    2     0   0       0
## 17       0          0    0     0      0         0    0    0     0   0       0
## 18       0          0    0     1      0         0    1    4     1   0       0
##    year years gave information like loans making put student transferred
## 6     0     0    0           0    0     0      0   0       1           0
## 8     0     0    0           0    0     0      0   0       0           0
## 12    0     0    0           1    0     0      0   0       0           0
## 16    0     1    0           0    0     0      0   0       0           0
## 17    0     0    0           1    0     0      0   0       0           0
## 18    0     0    0           0    1     0      0   1       0           0
##    business service federal instead review second take fees full keep made mail
## 6         0       0       0       0      0      0    0    1    1    1    2    1
## 8         0       0       0       0      0      0    0    0    0    0    0    0
## 12        0       0       0       0      0      0    1    0    0    0    0    0
## 16        0       0       0       0      2      0    0    0    0    0    0    0
## 17        0       0       0       0      0      0    0    2    0    0    2    1
## 18        0       0       0       0      0      0    0    0    0    0    1    0
##    online paid payment purchase score system thought work answer applied called
## 6       1    3       2        3     1      1       2    1      0       0      0
## 8       0    0       0        0     0      0       0    0      0       0      1
## 12      0    3       2        0     0      0       0    0      0       0      2
## 16      0    1       2        0     0      0       0    0      0       3      1
## 17      1    0       0        2     0      0       0    0      0       0      1
## 18      0    0       0        0     0      0       0    0      0       0      1
##    check fee previous provided representative response started three calls
## 6      0   0        0        0              0        0       0     0     0
## 8      0   0        0        0              0        0       0     0     1
## 12     0   0        0        0              0        0       0     0     0
## 16     0   0        0        0              0        1       0     0     1
## 17     1   0        0        0              0        0       0     1     0
## 18     0   0        0        0              0        0       0     0     0
##    company days end hours payments phone agreement asking loan office paperwork
## 6        0    0   0     0        0     0         0      0    0      0         0
## 8        1    1   1     1        1     2         0      0    0      0         0
## 12       1    0   0     0        4     0         0      0    0      0         0
## 16       0    1   0     0        0     0         0      0    0      0         0
## 17       0    0   0     0        0     0         0      0    0      0         0
## 18       0    0   0     0        0     0         0      0    0      0         0
##    paying regarding statements understand advised balance money monthly rep
## 6       0         0          0          0       0       0     0       0   0
## 8       0         0          0          0       0       0     0       0   0
## 12      0         0          0          0       0       1     0       0   0
## 16      0         0          0          0       0       0     0       0   0
## 17      0         0          0          0       0       0     1       0   0
## 18      0         0          0          0       0       0     0       0   0
##    reporting returned website week complete department documents modification
## 6          0        0       0    0        0          0         0            0
## 8          0        0       0    0        0          0         0            0
## 12         0        0       0    0        0          0         0            0
## 16         0        0       0    0        0          0         0            0
## 17         0        0       0    0        0          0         0            0
## 18         0        0       0    0        0          0         0            0
##    month number process stop funds give matter provide say thank times
## 6      0      0       0    0     0    0      0       0   0     0     0
## 8      0      0       0    0     0    0      0       0   0     0     0
## 12     0      0       1    0     1    1      1       1   1     1     1
## 16     0      0       0    0     0    0      0       0   0     0     0
## 17     0      0       0    0     0    1      0       0   0     0     0
## 18     0      0       0    0     0    0      1       0   0     0     0
##    transaction believe currently program services several address calling claim
## 6            0       0         0       0        0       0       0       0     0
## 8            0       0         0       0        0       0       0       0     0
## 12           1       0         0       0        0       0       0       0     0
## 16           0       0         1       0        0       0       0       0     0
## 17           0       0         0       0        0       0       2       0     0
## 18           0       1         0       0        0       0       0       0     0
##    filed informed let someone status used continue found taken based cfpb rate
## 6      0        0   0       0      0    0        0     0     0     0    0    0
## 8      0        0   0       0      0    0        0     0     0     0    0    0
## 12     0        0   0       0      0    0        0     0     0     0    0    0
## 16     0        0   0       0      0    0        0     0     0     1    1    6
## 17     0        1   0       0      1    0        0     0     0     0    0    0
## 18     0        0   1       0      0    1        0     0     0     0    0    0
##    refund requested resolve total tried change changed day notified wrong close
## 6       0         0       0     0     0      0       0   0        0     0     0
## 8       0         0       0     0     0      0       0   0        0     0     0
## 12      0         0       0     0     0      0       0   0        0     0     0
## 16      1         1       1     1     1      0       0   0        0     0     0
## 17      0         0       0     0     0      1       1   1        2     1     0
## 18      0         0       0     0     0      0       0   0        0     0     2
##    offered opened bill car hold reason said agent fraudulent insurance property
## 6        0      0    0   0    0      0    0     0          0         0        0
## 8        0      0    0   0    0      0    0     0          0         0        0
## 12       0      0    0   0    0      0    0     0          0         0        0
## 16       0      0    0   0    0      0    0     0          0         0        0
## 17       0      0    0   0    0      0    0     0          0         0        0
## 18       1      1    0   0    0      0    0     0          0         0        0
##    servicing already correct current dated foreclosure law part place required
## 6          0       0       0       0     0           0   0    0     0        0
## 8          0       0       0       0     0           0   0    0     0        0
## 12         0       0       0       0     0           0   0    0     0        0
## 16         0       0       0       0     0           0   0    0     0        0
## 17         0       0       0       0     0           0   0    0     0        0
## 18         0       0       0       0     0           0   0    0     0        0
##    came house lender application apply needed plan situation customer long
## 6     0     0      0           0     0      0    0         0        0    0
## 8     0     0      0           0     0      0    0         0        0    0
## 12    0     0      0           0     0      0    0         0        0    0
## 16    0     0      0           0     0      0    0         0        0    0
## 17    0     0      0           0     0      0    0         0        0    0
## 18    0     0      0           0     0      0    0         0        0    0
##    multiple saying terms weeks attached available denied email error set
## 6         0      0     0     0        0         0      0     0     0   0
## 8         0      0     0     0        0         0      0     0     0   0
## 12        0      0     0     0        0         0      0     0     0   0
## 16        0      0     0     0        0         0      0     0     0   0
## 17        0      0     0     0        0         0      0     0     0   0
## 18        0      0     0     0        0         0      0     0     0   0
##    closing sold tell escrow took attorney ive resolved reported state states
## 6        0    0    0      0    0        0   0        0        0     0      0
## 8        0    0    0      0    0        0   0        0        0     0      0
## 12       0    0    0      0    0        0   0        0        0     0      0
## 16       0    0    0      0    0        0   0        0        0     0      0
## 17       0    0    0      0    0        0   0        0        0     0      0
## 18       0    0    0      0    0        0   0        0        0     0      0
##    file though two different longer receiving accounts consumer documentation
## 6     0      0   0         0      0         0        0        0             0
## 8     0      0   0         0      0         0        0        0             0
## 12    0      0   0         0      0         0        0        0             0
## 16    0      0   0         0      0         0        0        0             0
## 17    0      0   0         0      0         0        0        0             0
## 18    0      0   0         0      0         0        0        0             0
##    nothing return cards line notice signed explained proof approved offer
## 6        0      0     0    0      0      0         0     0        0     0
## 8        0      0     0    0      0      0         0     0        0     0
## 12       0      0     0    0      0      0         0     0        0     0
## 16       0      0     0    0      0      0         0     0        0     0
## 17       0      0     0    0      0      0         0     0        0     0
## 18       0      0     0    0      0      0         0     0        0     0
##    supervisor trying agreed copy issues history show transfer look purchased
## 6           0      0      0    0      0       0    0        0    0         0
## 8           0      0      0    0      0       0    0        0    0         0
## 12          0      0      0    0      0       0    0        0    0         0
## 16          0      0      0    0      0       0    0        0    0         0
## 17          0      0      0    0      0       0    0        0    0         0
## 18          0      0      0    0      0       0    0        0    0         0
##    care owed manager speak showing
## 6     0    0       0     0       0
## 8     0    0       0     0       0
## 12    0    0       0     0       0
## 16    0    0       0     0       0
## 17    0    0       0     0       0
## 18    0    0       0     0       0

Visualize: side-by-side barchart

ggplot(data = predicted_vs_true, aes(x = predicted_Product, fill = Product)) + 
  geom_bar(position = "dodge") +
  labs(x = "Predicted Product", y = "Frequency")

Model fit: with cross validation

resample_fit <- fit_resamples(workflow, vfold_train)
collect_metrics(resample_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n  std_err .config             
##   <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
## 1 accuracy multiclass 0.841    20 0.000866 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.930    20 0.000982 Preprocessor1_Model1

Accuracy is still 0.841 -> 84.1%

Tuning hyperparameters: can improve model performance

Complexity parameter = cost_complexity(); depth parameter = tree_depth()

model_tune <- parsnip::decision_tree(cost_complexity=tune(), tree_depth=tune()) %>%
  parsnip::set_mode("classification") %>%
  parsnip::set_engine("rpart") 
model_tune
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
## 
## Computational engine: rpart

New workflow:

workflow_tune <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(model_tune)

Fit a model:

set.seed(1234)

tune_grid <- grid_regular(cost_complexity(), tree_depth(), levels=3)

resample_fit2 <- tune_grid(workflow_tune,
                           resamples = vfold_train,
                           grid = tune_grid,
                           metrics=metric_set(accuracy, roc_auc))

Assess performance:

See the min_n values for the top performing models (those with the highest accuracy)

show_best(resample_fit2, metric="accuracy")
## # A tibble: 5 × 8
##   cost_complexity tree_depth .metric  .estimator  mean     n  std_err .config   
##             <dbl>      <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>     
## 1    0.0000000001         15 accuracy multiclass 0.879    20 0.00113  Preproces…
## 2    0.00000316           15 accuracy multiclass 0.879    20 0.00113  Preproces…
## 3    0.0000000001          8 accuracy multiclass 0.854    20 0.000805 Preproces…
## 4    0.00000316            8 accuracy multiclass 0.854    20 0.000805 Preproces…
## 5    0.0000000001          1 accuracy multiclass 0.657    20 0.00138  Preproces…
tuned_values <- select_best(resample_fit2, "accuracy")
tuned_values
## # A tibble: 1 × 3
##   cost_complexity tree_depth .config             
##             <dbl>      <int> <chr>               
## 1    0.0000000001         15 Preprocessor1_Model7
workflow_tuned <- workflow_tune %>%
  finalize_workflow(tuned_values)

Finalize the model/workflow that we we used for tuning with these values:

?last_fit

overallfit <- last_fit(object = workflow_tuned, split = split)
collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy multiclass     0.878 Preprocessor1_Model1
## 2 roc_auc  hand_till      0.960 Preprocessor1_Model1

Accuracy is 0.878 -> 87.8%

See predicted values:

predictions <- collect_predictions(overallfit)

Visualize: side-by-side barchart

ggplot(data = predictions, aes(x = .pred_class, fill = Product)) + 
  geom_bar(position = "dodge") +
  labs(x = "Predicted Product", y = "Frequency")

Alternative: stacked barchart

ggplot(data = predictions, aes(x = .pred_class, fill = Product)) + 
  geom_bar() +
  labs(x = "Predicted Product", y = "Frequency")

Working with testing data file

Look at the testing data:

head(test)
##   problem_id
## 1          1
## 2          2
## 3          3
## 4          4
## 5          5
## 6          6
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Consumer.complaint.narrative
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       I have multiple lateness/missed payments on my credit report due to ECMC not properly contacting me to inform me when payments were due. I never received any statements in regards to my loan. It was also a period in time when my loan was deferred so there should not be any late/ missed payments that is being reporting on my credit file.
## 2 I lost my job in XXXX due to the coronavirus and was not able to get unemployment I been filing and calling and appealing my claim but still no money I lost my house and staying with my parents I told this company that and they have a device on the car to stop it with no payment. Do they stop my car I told them that I finally found a temporary job its temp to hire and it starts this Monday XX/XX/XXXX I sent the email and ask please send a signal to my call so I can go sign my paperwork to start they told me Im behind on my note and that they would not be able to send the signal unless I pay XXXX dollars I cried on the phone begging and pleading eith them I told them I didnt have it that my family has helped me since XXXX I can give them XXXX dollars and with my fist check in XXXX I could pay XXXX every two weeks they told me no and the guy XXXX was very rude abs belittling and I ask to speak to the manager XXXX and he told me she was in a meeting and that the information came from her I cried I said please let me talk to her he told me she would call me after the meeting its now XXXX on Friday XX/XX/XXXX I guess shes still in the meeting this is the worst company to get a car from I will never do business with them again I told them to come get the car I have been through too much this year I lost my XXXX to the XXXX  and I lost my XXXX  to XXXX/ the XXXX  I lost my house my job and now my car
## 3                                                                                                                                                  I had contacted great lakes about my student loan. I told them I couldn't afford payments i couldn't find a job. So they put me in forberance but didnt quit explaining. It to me this was the end of XXXX. Well when I got on the pay as you go plan I got charged the interest than for some reason there was. Months were they put me in forberance tha. The same month on a payment plan and that went back and fourth like that for quite a few months. I turned in my paperwork at the end of XXXX but they didn't approve it till XXXX so I got charged interest. As well as XXXX XXXX to them about the borrowers defense they shut me down didnt even try to help. They know I work for a non profit organization and am gon na try for the pslf but they have never said that they don't dont do the qualifying for that as well as I've talked several times with the csr and wanted to make sure I was on the right payment plan to count for the pslf loan they would tell me one thing but than it would change I still dont know exactly if the pmt plan im on is correct i got nothing but resistance when trying the borrowers defense loan. The company is not very good and i dont feel that any advice they give is in to help me out. Just their pockets I learned that early on dealing with them
## 4                                                                               Wells Fargo Home Mortgage put me in a payment suspension program 3 months ago, even though I did not need it. I was not suffering a hardship from Covid 19. I not only haven't missed a payment during the 3-month forbearance, I've actually made 3 additional mortgage payments in that time period. On the Wells Fargo website, it is stated that I can end the payment suspension at any time. THAT IS A LIE. I have requested that the program end multiple times, and they still have not done it. In addition, because I am still in forbearance, my mortgage account statements are not accurate or transparent. The information on my statements appear to be based on a strange and questionable accounting program devised by Wells Fargo. What credible reason could Wells Fargo possibly have for their obscure accounting, as statements should be accurate at all times. I have questioned WF executives about their accounting and I've been told repeatedly that the statements will be adjusted at a later time. What kind of BS is that? I am requesting again, that I be taken out of the payment suspension program. I also expect Wells Fargo to provide me with an updated and ACCURATE mortgage statement. It doesn't say much for the credibility of Wells Fargo Home Mortgage that I have to address these issues through Consumer Financial Protection Bureau..
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                             We had a loan through Kia Motors Finance on our XXXX XXXX which we paid off XX/XX/XXXX. Since then, we have called customer service numerous times checking the status of our overpayment refund, but each time the story is the same : that a refund has been processed but not sent, and that there is no way to contact the responsible department other than by email, to which there has to-date never been a response. I have been reassured by various reps that they are looking into it, but nothing has changed. \n\nIt is now more than XXXX months later. If I were XXXX months late with a payment, Kia would at least continue charging interest, if not take legal action. I have contacted Kia 's complaint department in writing ( the only method which I was told will guarantee a response ), and I will be taking legal action if this matter is not addressed shortly. The pandemic is no excuse. Everyone else has carried on business as usual. Kia has a responsibility to as well.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             I have received a billing statement from HOME DEPOT for a credit card which has NEVER been applied by me, this was received in USPS mail on XX/XX/2020. \n\nThe amount of credit limit on the statement shows up as {$8000.00} and the Purchases made fraudulently on that are of {$5300.00} I had called the HOME DEPOT credit card Customer Service on XXXX, and they were not able to resolve so they connected me to their fraud protection agency who were on the phone with me on XX/XX/2020 for 1 hour 16 minutes from XXXX CT Towards the end of my call with them, the fraud protection agency did confirm that the issue has been resolved and provided a reference # XXXX, but I repeatedly asked them to send me some communication by mail or email, but they were not willing to do so. \n\nI just need an assurance that this is resolved and that I am not liable to pay any amounts for purchases which have NOT been made.
##                    Company State ZIP.code Submitted.via
## 1         ECMC Group, Inc.    NY    112XX           Web
## 2 Michael Wayne Investment    VA    234XX           Web
## 3             Nelnet, Inc.    AR    727XX           Web
## 4    WELLS FARGO & COMPANY    PA    189XX           Web
## 5  HYUNDAI CAPITAL AMERICA    IN    471XX           Web
## 6           CITIBANK, N.A.    TX    750XX           Web

Clean the data similarly to train data:

test$Consumer.complaint.narrative <- gsub("XX/", "", test$Consumer.complaint.narrative)
test$Consumer.complaint.narrative <- gsub("X/", "", test$Consumer.complaint.narrative)
test$Consumer.complaint.narrative <- gsub("[XX]", "", test$Consumer.complaint.narrative)
test$Consumer.complaint.narrative <- gsub("[[:punct:]]", "", test$Consumer.complaint.narrative)
test$Consumer.complaint.narrative <- gsub("[0-9]", "", test$Consumer.complaint.narrative)
test$Consumer.complaint.narrative <- str_to_lower(test$Consumer.complaint.narrative)

head(test)
##   problem_id
## 1          1
## 2          2
## 3          3
## 4          4
## 5          5
## 6          6
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Consumer.complaint.narrative
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              i have multiple latenessmissed payments on my credit report due to ecmc not properly contacting me to inform me when payments were due i never received any statements in regards to my loan it was also a period in time when my loan was deferred so there should not be any late missed payments that is being reporting on my credit file
## 2 i lost my job in  due to the coronavirus and was not able to get unemployment i been filing and calling and appealing my claim but still no money i lost my house and staying with my parents i told this company that and they have a device on the car to stop it with no payment do they stop my car i told them that i finally found a temporary job its temp to hire and it starts this monday  i sent the email and ask please send a signal to my call so i can go sign my paperwork to start they told me im behind on my note and that they would not be able to send the signal unless i pay  dollars i cried on the phone begging and pleading eith them i told them i didnt have it that my family has helped me since  i can give them  dollars and with my fist check in  i could pay  every two weeks they told me no and the guy  was very rude abs belittling and i ask to speak to the manager  and he told me she was in a meeting and that the information came from her i cried i said please let me talk to her he told me she would call me after the meeting its now  on friday  i guess shes still in the meeting this is the worst company to get a car from i will never do business with them again i told them to come get the car i have been through too much this year i lost my  to the   and i lost my   to  the   i lost my house my job and now my car
## 3                                                                                                        i had contacted great lakes about my student loan i told them i couldnt afford payments i couldnt find a job so they put me in forberance but didnt quit explaining it to me this was the end of  well when i got on the pay as you go plan i got charged the interest than for some reason there was months were they put me in forberance tha the same month on a payment plan and that went back and fourth like that for quite a few months i turned in my paperwork at the end of  but they didnt approve it till  so i got charged interest as well as   to them about the borrowers defense they shut me down didnt even try to help they know i work for a non profit organization and am gon na try for the pslf but they have never said that they dont dont do the qualifying for that as well as ive talked several times with the csr and wanted to make sure i was on the right payment plan to count for the pslf loan they would tell me one thing but than it would change i still dont know exactly if the pmt plan im on is correct i got nothing but resistance when trying the borrowers defense loan the company is not very good and i dont feel that any advice they give is in to help me out just their pockets i learned that early on dealing with them
## 4                                  wells fargo home mortgage put me in a payment suspension program  months ago even though i did not need it i was not suffering a hardship from covid  i not only havent missed a payment during the month forbearance ive actually made  additional mortgage payments in that time period on the wells fargo website it is stated that i can end the payment suspension at any time that is a lie i have requested that the program end multiple times and they still have not done it in addition because i am still in forbearance my mortgage account statements are not accurate or transparent the information on my statements appear to be based on a strange and questionable accounting program devised by wells fargo what credible reason could wells fargo possibly have for their obscure accounting as statements should be accurate at all times i have questioned wf executives about their accounting and ive been told repeatedly that the statements will be adjusted at a later time what kind of bs is that i am requesting again that i be taken out of the payment suspension program i also expect wells fargo to provide me with an updated and accurate mortgage statement it doesnt say much for the credibility of wells fargo home mortgage that i have to address these issues through consumer financial protection bureau
## 5                                                                                                                                                                                                                                                                                                                                                                                                                               we had a loan through kia motors finance on our   which we paid off  since then we have called customer service numerous times checking the status of our overpayment refund but each time the story is the same  that a refund has been processed but not sent and that there is no way to contact the responsible department other than by email to which there has todate never been a response i have been reassured by various reps that they are looking into it but nothing has changed \n\nit is now more than  months later if i were  months late with a payment kia would at least continue charging interest if not take legal action i have contacted kia s complaint department in writing  the only method which i was told will guarantee a response  and i will be taking legal action if this matter is not addressed shortly the pandemic is no excuse everyone else has carried on business as usual kia has a responsibility to as well
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               i have received a billing statement from home depot for a credit card which has never been applied by me this was received in usps mail on  \n\nthe amount of credit limit on the statement shows up as  and the purchases made fraudulently on that are of  i had called the home depot credit card customer service on  and they were not able to resolve so they connected me to their fraud protection agency who were on the phone with me on  for  hour  minutes from  ct towards the end of my call with them the fraud protection agency did confirm that the issue has been resolved and provided a reference   but i repeatedly asked them to send me some communication by mail or email but they were not willing to do so \n\ni just need an assurance that this is resolved and that i am not liable to pay any amounts for purchases which have not been made
##                    Company State ZIP.code Submitted.via
## 1         ECMC Group, Inc.    NY    112XX           Web
## 2 Michael Wayne Investment    VA    234XX           Web
## 3             Nelnet, Inc.    AR    727XX           Web
## 4    WELLS FARGO & COMPANY    PA    189XX           Web
## 5  HYUNDAI CAPITAL AMERICA    IN    471XX           Web
## 6           CITIBANK, N.A.    TX    750XX           Web

Create a term matrix:

test_matrix <- DocumentTermMatrix(test$Consumer.complaint.narrative)
inspect(test_matrix)
## <<DocumentTermMatrix (documents: 20, terms: 1234)>>
## Non-/sparse entries: 2350/22330
## Sparsity           : 90%
## Maximal term length: 16
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs and for forbearance have not that the they was with
##   11   4   5           0    2   4    5   6    3   0    0
##   12  16   2           0    1   7    4  39    1  12   10
##   13  17   6           0    3   6    4  19    8   5    3
##   17  29  11           7    4   4   25  33    0  29    5
##   2   16   0           0    3   2    5  14    5   3    4
##   3    5   6           0    1   1    6  14    9   3    2
##   4    4   2           2    5   5    8   8    1   1    1
##   5    2   0           0    3   3    3   6    1   1    1
##   7    9   4           0    0   5    1  10    3   4    2
##   9   33  31          33    9   7   38  81    6  30   21

Convert test matrix to dataframe:

test_matrix_df <- test_matrix %>%
  as.matrix() %>%
  as.data.frame() 

Make prediction:

predict_model <- fit(workflow_tuned, matrix_df)
predict(predict_model, new_data = test_matrix_df)
## Error in `validate_column_names()`:
## ! The following required columns are missing: 'chase', 'happened', 'past', 'second', 'keep', 'system', 'thought', 'answer', 'previous', 'office', 'rep', 'services', 'cfpb', 'wrong', 'opened', 'insurance', 'property', 'foreclosure', 'part', 'lender', 'application', 'apply', 'long', 'sold', 'attorney', 'state', 'accounts', 'return', 'proof', 'supervisor', 'show', 'purchased', 'owed', 'showing'.

We have an error: some columns that exist in training data are missing from testing data

Add them manually:

test_matrix_df$chase <- 0
test_matrix_df$cant <- 0
test_matrix_df$happened <- 0 
test_matrix_df$past <- 0  
test_matrix_df$something <- 0  
test_matrix_df$second <- 0  
test_matrix_df$keep <- 0  
test_matrix_df$system <- 0  
test_matrix_df$thought <- 0  
test_matrix_df$answer <- 0  
test_matrix_df$previous <- 0  
test_matrix_df$office <- 0  
test_matrix_df$rep <- 0  
test_matrix_df$assistance <- 0  
test_matrix_df$services <- 0  
test_matrix_df$either <- 0  
test_matrix_df$cfpb <- 0  
test_matrix_df$wrong <- 0  
test_matrix_df$opened <- 0  
test_matrix_df$insurance <- 0  
test_matrix_df$property <- 0  
test_matrix_df$foreclosure <- 0  
test_matrix_df$part <- 0  
test_matrix_df$lender <- 0  
test_matrix_df$application <- 0  
test_matrix_df$apply <- 0  
test_matrix_df$recently <- 0  
test_matrix_df$always <- 0  
test_matrix_df$long <- 0  
test_matrix_df$sold <- 0  
test_matrix_df$attorney <- 0  
test_matrix_df$state <- 0  
test_matrix_df$must <- 0  
test_matrix_df$accounts <- 0  
test_matrix_df$return <- 0  
test_matrix_df$proof <- 0  
test_matrix_df$supervisor <- 0  
test_matrix_df$wasnt <- 0  
test_matrix_df$show <- 0  
test_matrix_df$purchased <- 0  
test_matrix_df$person <- 0  
test_matrix_df$owed <- 0  
test_matrix_df$showing <- 0

Make new prediction:

prediction <- predict(predict_model, new_data = test_matrix_df)

Merge back to test dataset:

prediction <- bind_cols(test, predicted_Product=pull(prediction, .pred_class)) %>%
  select(predicted_Product, everything())

head(prediction)
##             predicted_Product problem_id
## 1       Vehicle loan or lease          1
## 2       Vehicle loan or lease          2
## 3                Student loan          3
## 4                    Mortgage          4
## 5                Student loan          5
## 6 Credit card or prepaid card          6
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Consumer.complaint.narrative
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              i have multiple latenessmissed payments on my credit report due to ecmc not properly contacting me to inform me when payments were due i never received any statements in regards to my loan it was also a period in time when my loan was deferred so there should not be any late missed payments that is being reporting on my credit file
## 2 i lost my job in  due to the coronavirus and was not able to get unemployment i been filing and calling and appealing my claim but still no money i lost my house and staying with my parents i told this company that and they have a device on the car to stop it with no payment do they stop my car i told them that i finally found a temporary job its temp to hire and it starts this monday  i sent the email and ask please send a signal to my call so i can go sign my paperwork to start they told me im behind on my note and that they would not be able to send the signal unless i pay  dollars i cried on the phone begging and pleading eith them i told them i didnt have it that my family has helped me since  i can give them  dollars and with my fist check in  i could pay  every two weeks they told me no and the guy  was very rude abs belittling and i ask to speak to the manager  and he told me she was in a meeting and that the information came from her i cried i said please let me talk to her he told me she would call me after the meeting its now  on friday  i guess shes still in the meeting this is the worst company to get a car from i will never do business with them again i told them to come get the car i have been through too much this year i lost my  to the   and i lost my   to  the   i lost my house my job and now my car
## 3                                                                                                        i had contacted great lakes about my student loan i told them i couldnt afford payments i couldnt find a job so they put me in forberance but didnt quit explaining it to me this was the end of  well when i got on the pay as you go plan i got charged the interest than for some reason there was months were they put me in forberance tha the same month on a payment plan and that went back and fourth like that for quite a few months i turned in my paperwork at the end of  but they didnt approve it till  so i got charged interest as well as   to them about the borrowers defense they shut me down didnt even try to help they know i work for a non profit organization and am gon na try for the pslf but they have never said that they dont dont do the qualifying for that as well as ive talked several times with the csr and wanted to make sure i was on the right payment plan to count for the pslf loan they would tell me one thing but than it would change i still dont know exactly if the pmt plan im on is correct i got nothing but resistance when trying the borrowers defense loan the company is not very good and i dont feel that any advice they give is in to help me out just their pockets i learned that early on dealing with them
## 4                                  wells fargo home mortgage put me in a payment suspension program  months ago even though i did not need it i was not suffering a hardship from covid  i not only havent missed a payment during the month forbearance ive actually made  additional mortgage payments in that time period on the wells fargo website it is stated that i can end the payment suspension at any time that is a lie i have requested that the program end multiple times and they still have not done it in addition because i am still in forbearance my mortgage account statements are not accurate or transparent the information on my statements appear to be based on a strange and questionable accounting program devised by wells fargo what credible reason could wells fargo possibly have for their obscure accounting as statements should be accurate at all times i have questioned wf executives about their accounting and ive been told repeatedly that the statements will be adjusted at a later time what kind of bs is that i am requesting again that i be taken out of the payment suspension program i also expect wells fargo to provide me with an updated and accurate mortgage statement it doesnt say much for the credibility of wells fargo home mortgage that i have to address these issues through consumer financial protection bureau
## 5                                                                                                                                                                                                                                                                                                                                                                                                                               we had a loan through kia motors finance on our   which we paid off  since then we have called customer service numerous times checking the status of our overpayment refund but each time the story is the same  that a refund has been processed but not sent and that there is no way to contact the responsible department other than by email to which there has todate never been a response i have been reassured by various reps that they are looking into it but nothing has changed \n\nit is now more than  months later if i were  months late with a payment kia would at least continue charging interest if not take legal action i have contacted kia s complaint department in writing  the only method which i was told will guarantee a response  and i will be taking legal action if this matter is not addressed shortly the pandemic is no excuse everyone else has carried on business as usual kia has a responsibility to as well
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               i have received a billing statement from home depot for a credit card which has never been applied by me this was received in usps mail on  \n\nthe amount of credit limit on the statement shows up as  and the purchases made fraudulently on that are of  i had called the home depot credit card customer service on  and they were not able to resolve so they connected me to their fraud protection agency who were on the phone with me on  for  hour  minutes from  ct towards the end of my call with them the fraud protection agency did confirm that the issue has been resolved and provided a reference   but i repeatedly asked them to send me some communication by mail or email but they were not willing to do so \n\ni just need an assurance that this is resolved and that i am not liable to pay any amounts for purchases which have not been made
##                    Company State ZIP.code Submitted.via
## 1         ECMC Group, Inc.    NY    112XX           Web
## 2 Michael Wayne Investment    VA    234XX           Web
## 3             Nelnet, Inc.    AR    727XX           Web
## 4    WELLS FARGO & COMPANY    PA    189XX           Web
## 5  HYUNDAI CAPITAL AMERICA    IN    471XX           Web
## 6           CITIBANK, N.A.    TX    750XX           Web

Visualize: scatter plot with Problem ID as y-axis and Predicted Product as x-axis

ggplot(data = prediction, aes(x = predicted_Product, y = problem_id)) + 
  geom_point() + 
  labs(x = "Predicted Product Category", y = "Problem ID") +
  scale_y_continuous(breaks = seq(0, 20, by = 1)) +
  theme_bw()