The general idea is to only use 2/3 of the giving training set literally for training, and hold out 1/3 as validation portion. As reliable variables for prediction purpose, I found out, that an attribution of the “Company”-variable to a categorical variable “com” with 4 different possible values (“A”,“B”,“C” and “D”) might be useful. This is due to the fact that certain companies are specialized on some products, e.g. some companies offer exclusively student loans, or there is at least a majority of a certain category. Companies which are mainly focused on mortgages get the category “A” (respectively “Student loan” ^=“B”, “Credit card or prepaid card” ^=“C”, “Vehicle loan or lease” ^= “D”).
library(tidyverse)
s1=filter(df,Product=="Mortgage") #df is the dataset with all observations (training,validation and test)
s1=group_by(s1,Company)
s1=summarise(s1,m1=n())
s2=filter(df,Product=="Student loan")
s2=group_by(s2,Company)
s2=summarise(s2,m2=n())
s3=filter(df,Product=="Credit card or prepaid card")
s3=group_by(s3,Company)
s3=summarise(s3,m3=n())
s4=filter(df,Product=="Vehicle loan or lease")
s4=group_by(s4,Company)
s4=summarise(s4,m4=n())
s12=full_join(s1,s2,by="Company")
s12$Company=as.character(s12$Company)
s34=full_join(s3,s4,by="Company")
s34$Company=as.character(s34$Company)
s=full_join(s12,s34,by="Company")
s[is.na(s)]<-0
s$m<-apply(s[,-1],1,max)
s$com=ifelse(s$m==s$m1,"A",
ifelse(s$m==s$m2,"B",
ifelse(s$m==s$m3 "C","D")))
Another idea to get some predictive value out of a given variable, is to examine the word distribution of the “Consumer complaint narrative”, divided by Product Category. Due to space limitations I only use 1000 observations for each category. I transform every word to lowercase, remove punctuation and numbers and so called “stop words” (i.e. very frequent words like “the” and “and”). I determine the respective frequency, as well as the percentage of text covered by the respective word. Moreover, I also add the cumulative frequency.
## [1] 20586 7
## xxxx xxxxxxxx loan mortgage payment payments home
## 12047 2830 2259 2177 1773 1026 1005
## told account received company time pay will
## 949 919 884 769 747 707 698
## called sent bank get escrow back letter
## 662 654 645 623 609 587 584
## due never insurance paid now property can
## 567 566 563 561 553 552 549
## made call
## 536 523
## xxxx loan xxxxxxxx payments loans payment
## 7443 2354 2062 1663 1651 1640
## navient student told interest account time
## 1130 989 841 830 813 781
## pay credit received amount made due
## 733 640 622 606 592 570
## information called years can get make
## 566 565 563 541 503 491
## now back company month never paid
## 473 462 460 460 456 455
## xxxx card credit account xxxxxxxx payment bank called
## 7703 2541 2254 1836 1619 938 757 755
## told one received time balance late made never
## 754 728 688 626 562 548 537 519
## due pay interest back paid call said will
## 506 482 476 473 463 462 462 461
## charges company days chase fee customer
## 442 441 435 433 432 415
## xxxx xxxxxxxx payment car loan credit vehicle
## 9629 1647 1637 1626 1310 1187 1084
## told payments account time get called company
## 980 932 862 744 725 708 696
## back pay paid made call said due
## 670 641 637 615 603 579 554
## received late will never amount financial one
## 545 536 534 525 516 509 491
## can asked
## 458 452
As mentioned in the first section, the idea is to divide the given original training set into a training and validation portion (ratio 2:1,the validation set is called tst2 below). I apply this instead of k-fold cross-validation due to the chosen machine learning method. From the above shown distribution of words in relation to the respective product, I add the counts of certain words in the Consumer.complaint.narrative as well as the variable “com” (described in the exploration section) into a Random Forest Model. Why Random Forest? It is a good model to obtain high performance with less need for interpretation. Random Forest generalizes better than ordinary Decision Trees, as it searches for the best feature among a random subset of Features.
library(stringr)
library(dplyr)
library(randomForest)
library(rsample)
library(tidymodels)
set.seed(7893)
df0<-read.csv("trans_complaints_train.csv")
tst<-read.csv("trans_complaints_test.csv")
df1<-rsample::initial_split(data=df0,prop=2/3)
tr<-rsample::training(df1)
tst2<-rsample::testing(df1)
tr$credit<-str_count(tr$Consumer.complaint.narrative,"credit")
tst$credit=str_count(tst$Consumer.complaint.narrative,"credit")
tst2$credit=str_count(tst2$Consumer.complaint.narrative,"credit")
tr$mortgage=str_count(tr$Consumer.complaint.narrative,"mortgage")
tst$mortgage=str_count(tst$Consumer.complaint.narrative,"mortgage")
tst2$mortgage=str_count(tst2$Consumer.complaint.narrative,"mortgage")
tr$lease=str_count(tr$Consumer.complaint.narrative,"lease")
tst$lease=str_count(tst$Consumer.complaint.narrative,"lease")
tst2$lease=str_count(tst2$Consumer.complaint.narrative,"lease")
tr$car=str_count(tr$Consumer.complaint.narrative,"car")
tst$car=str_count(tst$Consumer.complaint.narrative,"car")
tst2$car=str_count(tst2$Consumer.complaint.narrative, "car")
tr$student=str_count(tr$Consumer.complaint.narrative,"student")
tst$student=str_count(tst$Consumer.complaint.narrative,"student")
tst2$student=str_count(tst2$Consumer.complaint.narrative,"student")
tr$charge=str_count(tr$Consumer.complaint.narrative,"charge")
tst$charge=str_count(tst$Consumer.complaint.narrative,"charge")
tst2$charge=str_count(tst2$Consumer.complaint.narrative,"charge")
tr$home=str_count(tr$Consumer.complaint.narrative,"home")
tst$home=str_count(tst$Consumer.complaint.narrative,"home")
tst2$home=str_count(tst2$Consumer.complaint.narrative,"home")
tr$ccard=str_count(tr$Consumer.complaint.narrative,"credit card")
tst$ccard=str_count(tst$Consumer.complaint.narrative,"credit card")
tst2$ccard=str_count(tst2$Consumer.complaint.narrative,"credit card")
tr$vehicle=str_count(tr$Consumer.complaint.narrative,"vehicle")
tst$vehicle=str_count(tst$Consumer.complaint.narrative,"vehicle")
tst2$vehicle=str_count(tst2$Consumer.complaint.narrative,"vehicle")
tr$navient=str_count(tr$Consumer.complaint.narrative,"navient")
tst$navient=str_count(tst$Consumer.complaint.narrative,"navient")
tst2$navient=str_count(tst2$Consumer.complaint.narrative,"navient")
tr$company=str_count(tr$Consumer.complaint.narrative,"company")
tst$company=str_count(tst$Consumer.complaint.narrative,"company")
tst2$company=str_count(tst2$Consumer.complaint.narrative,"company")
tr$loan=str_count(tr$Consumer.complaint.narrative,"loan")
tst$loan=str_count(tst$Consumer.complaint.narrative,"loan")
tst2$loan=str_count(tst2$Consumer.complaint.narrative,"loan")
modelRF<-randomForest(Product~credit+mortgage+lease+car+student+charge+ccard+home+vehicle+navient+company+loan+com,data=tr)
predRF<-predict(modelRF,tst,type="class")
print("Predictions on given test set (with 20 observations):")
print(predRF)
The result for the given test set are: 1. Student loan 2. Vehicle loan or lease 3. Student loan 4. Mortgage 5. Vehicle loan or lease 6. Credit card or prepaid card 7. Credit card or prepaid card 8. Credit card or prepaid card 9. Student loan 10. Vehicle loan or lease 11. Student loan 12. Mortgage 13. Credit card or prepaid card 14. Credit card or prepaid card 15. Mortgage 16. Credit card or prepaid card 17. Mortgage 18. Student loan 19. Mortgage 20. Mortgage
As I checked the consumer complaint narrative by hand, observation 13 and 19 were not predicted correctly, they are in fact both from the category “Vehicle loan or lease”, hence the accuracy is 90%.
In a somewhat tedious calculation in a separate Jupyter notebook, I found the following confusion matrix for the validation set:
predicted MG, true MG: 10000 | predicted MG, true SL: 62 | predicted MG, true CC: 302 | predicted MG, true VL: 247 | predicted SL, true MG: 12 | predicted SL, true SL: 4058 | predicted SL, true CC: 29 | predicted SL, true VL: 4 | predicted CC, true MG: 293 | predicted CC, true SL: 25 | predicted CC, true CC: 12325 | predicted CC, true VL: 200 | predicted VL, true MG: 53 | predicted VL, true SL: 5 | predicted VL, true CC: 84 | predicted VL, true VL: 2626 |
MG ^= Mortgage SL ^= Student loan CC ^= Credit card or prepaid card VL ^= Vehicle loan or lease
This results in an accuracy of (10000+4058+12325+2626)/30325, hence of about 95.66 %, i.e. the out-of-sample error is about 4.33 %.