Exploration Phase

The general idea is to only use 2/3 of the giving training set literally for training, and hold out 1/3 as validation portion. As reliable variables for prediction purpose, I found out, that an attribution of the “Company”-variable to a categorical variable “com” with 4 different possible values (“A”,“B”,“C” and “D”) might be useful. This is due to the fact that certain companies are specialized on some products, e.g. some companies offer exclusively student loans, or there is at least a majority of a certain category. Companies which are mainly focused on mortgages get the category “A” (respectively “Student loan” ^=“B”, “Credit card or prepaid card” ^=“C”, “Vehicle loan or lease” ^= “D”).

library(tidyverse)
s1=filter(df,Product=="Mortgage") #df is the dataset with all observations (training,validation and test)
s1=group_by(s1,Company)
s1=summarise(s1,m1=n())
s2=filter(df,Product=="Student loan")
s2=group_by(s2,Company)
s2=summarise(s2,m2=n())
s3=filter(df,Product=="Credit card or prepaid card")
s3=group_by(s3,Company)
s3=summarise(s3,m3=n())
s4=filter(df,Product=="Vehicle loan or lease")
s4=group_by(s4,Company)
s4=summarise(s4,m4=n())

s12=full_join(s1,s2,by="Company")
s12$Company=as.character(s12$Company)
s34=full_join(s3,s4,by="Company")
s34$Company=as.character(s34$Company)
s=full_join(s12,s34,by="Company")
s[is.na(s)]<-0
s$m<-apply(s[,-1],1,max)

s$com=ifelse(s$m==s$m1,"A",
             ifelse(s$m==s$m2,"B",
                    ifelse(s$m==s$m3 "C","D")))

Another idea to get some predictive value out of a given variable, is to examine the word distribution of the “Consumer complaint narrative”, divided by Product Category. Due to space limitations I only use 1000 observations for each category. I transform every word to lowercase, remove punctuation and numbers and so called “stop words” (i.e. very frequent words like “the” and “and”). I determine the respective frequency, as well as the percentage of text covered by the respective word. Moreover, I also add the cumulative frequency.

Case Mortgage

## [1] 20586     7
##      xxxx  xxxxxxxx      loan  mortgage   payment  payments      home 
##     12047      2830      2259      2177      1773      1026      1005 
##      told   account  received   company      time       pay      will 
##       949       919       884       769       747       707       698 
##    called      sent      bank       get    escrow      back    letter 
##       662       654       645       623       609       587       584 
##       due     never insurance      paid       now  property       can 
##       567       566       563       561       553       552       549 
##      made      call 
##       536       523

Case Student Loan

##        xxxx        loan    xxxxxxxx    payments       loans     payment 
##        7443        2354        2062        1663        1651        1640 
##     navient     student        told    interest     account        time 
##        1130         989         841         830         813         781 
##         pay      credit    received      amount        made         due 
##         733         640         622         606         592         570 
## information      called       years         can         get        make 
##         566         565         563         541         503         491 
##         now        back     company       month       never        paid 
##         473         462         460         460         456         455

Case Credit Card

##     xxxx     card   credit  account xxxxxxxx  payment     bank   called 
##     7703     2541     2254     1836     1619      938      757      755 
##     told      one received     time  balance     late     made    never 
##      754      728      688      626      562      548      537      519 
##      due      pay interest     back     paid     call     said     will 
##      506      482      476      473      463      462      462      461 
##  charges  company     days    chase      fee customer 
##      442      441      435      433      432      415

Case Vehicle Loan

##      xxxx  xxxxxxxx   payment       car      loan    credit   vehicle 
##      9629      1647      1637      1626      1310      1187      1084 
##      told  payments   account      time       get    called   company 
##       980       932       862       744       725       708       696 
##      back       pay      paid      made      call      said       due 
##       670       641       637       615       603       579       554 
##  received      late      will     never    amount financial       one 
##       545       536       534       525       516       509       491 
##       can     asked 
##       458       452

Random Forest

As mentioned in the first section, the idea is to divide the given original training set into a training and validation portion (ratio 2:1,the validation set is called tst2 below). I apply this instead of k-fold cross-validation due to the chosen machine learning method. From the above shown distribution of words in relation to the respective product, I add the counts of certain words in the Consumer.complaint.narrative as well as the variable “com” (described in the exploration section) into a Random Forest Model. Why Random Forest? It is a good model to obtain high performance with less need for interpretation. Random Forest generalizes better than ordinary Decision Trees, as it searches for the best feature among a random subset of Features.

library(stringr)
library(dplyr)
library(randomForest)
library(rsample)
library(tidymodels)
set.seed(7893)
df0<-read.csv("trans_complaints_train.csv")
tst<-read.csv("trans_complaints_test.csv")
df1<-rsample::initial_split(data=df0,prop=2/3)
tr<-rsample::training(df1)
tst2<-rsample::testing(df1)
tr$credit<-str_count(tr$Consumer.complaint.narrative,"credit")
tst$credit=str_count(tst$Consumer.complaint.narrative,"credit")
tst2$credit=str_count(tst2$Consumer.complaint.narrative,"credit")
tr$mortgage=str_count(tr$Consumer.complaint.narrative,"mortgage")
tst$mortgage=str_count(tst$Consumer.complaint.narrative,"mortgage")
tst2$mortgage=str_count(tst2$Consumer.complaint.narrative,"mortgage")
tr$lease=str_count(tr$Consumer.complaint.narrative,"lease")
tst$lease=str_count(tst$Consumer.complaint.narrative,"lease")
tst2$lease=str_count(tst2$Consumer.complaint.narrative,"lease")
tr$car=str_count(tr$Consumer.complaint.narrative,"car")
tst$car=str_count(tst$Consumer.complaint.narrative,"car")
tst2$car=str_count(tst2$Consumer.complaint.narrative, "car")
tr$student=str_count(tr$Consumer.complaint.narrative,"student")
tst$student=str_count(tst$Consumer.complaint.narrative,"student")
tst2$student=str_count(tst2$Consumer.complaint.narrative,"student")
tr$charge=str_count(tr$Consumer.complaint.narrative,"charge")
tst$charge=str_count(tst$Consumer.complaint.narrative,"charge")
tst2$charge=str_count(tst2$Consumer.complaint.narrative,"charge")
tr$home=str_count(tr$Consumer.complaint.narrative,"home")
tst$home=str_count(tst$Consumer.complaint.narrative,"home")
tst2$home=str_count(tst2$Consumer.complaint.narrative,"home")
tr$ccard=str_count(tr$Consumer.complaint.narrative,"credit card")
tst$ccard=str_count(tst$Consumer.complaint.narrative,"credit card")
tst2$ccard=str_count(tst2$Consumer.complaint.narrative,"credit card")
tr$vehicle=str_count(tr$Consumer.complaint.narrative,"vehicle")
tst$vehicle=str_count(tst$Consumer.complaint.narrative,"vehicle")
tst2$vehicle=str_count(tst2$Consumer.complaint.narrative,"vehicle")
tr$navient=str_count(tr$Consumer.complaint.narrative,"navient")
tst$navient=str_count(tst$Consumer.complaint.narrative,"navient")
tst2$navient=str_count(tst2$Consumer.complaint.narrative,"navient")
tr$company=str_count(tr$Consumer.complaint.narrative,"company")
tst$company=str_count(tst$Consumer.complaint.narrative,"company")
tst2$company=str_count(tst2$Consumer.complaint.narrative,"company")
tr$loan=str_count(tr$Consumer.complaint.narrative,"loan")
tst$loan=str_count(tst$Consumer.complaint.narrative,"loan")
tst2$loan=str_count(tst2$Consumer.complaint.narrative,"loan")
modelRF<-randomForest(Product~credit+mortgage+lease+car+student+charge+ccard+home+vehicle+navient+company+loan+com,data=tr)
predRF<-predict(modelRF,tst,type="class")   
print("Predictions on given test set (with 20 observations):")
print(predRF)

The result for the given test set are: 1. Student loan 2. Vehicle loan or lease 3. Student loan 4. Mortgage 5. Vehicle loan or lease 6. Credit card or prepaid card 7. Credit card or prepaid card 8. Credit card or prepaid card 9. Student loan 10. Vehicle loan or lease 11. Student loan 12. Mortgage 13. Credit card or prepaid card 14. Credit card or prepaid card 15. Mortgage 16. Credit card or prepaid card 17. Mortgage 18. Student loan 19. Mortgage 20. Mortgage

As I checked the consumer complaint narrative by hand, observation 13 and 19 were not predicted correctly, they are in fact both from the category “Vehicle loan or lease”, hence the accuracy is 90%.

In a somewhat tedious calculation in a separate Jupyter notebook, I found the following confusion matrix for the validation set:

predicted MG, true MG: 10000 | predicted MG, true SL: 62 | predicted MG, true CC: 302 | predicted MG, true VL: 247 | predicted SL, true MG: 12 | predicted SL, true SL: 4058 | predicted SL, true CC: 29 | predicted SL, true VL: 4 | predicted CC, true MG: 293 | predicted CC, true SL: 25 | predicted CC, true CC: 12325 | predicted CC, true VL: 200 | predicted VL, true MG: 53 | predicted VL, true SL: 5 | predicted VL, true CC: 84 | predicted VL, true VL: 2626 |

MG ^= Mortgage SL ^= Student loan CC ^= Credit card or prepaid card VL ^= Vehicle loan or lease

This results in an accuracy of (10000+4058+12325+2626)/30325, hence of about 95.66 %, i.e. the out-of-sample error is about 4.33 %.