AGA - Baltimore 2015

Text Mining Consumer Complaints
Classifying Complaints
Code

Text Mining Consumer Complaints

Consumer complaints data are released by the Consumer Financial Protection Bureau (CFPB) after they have been scrubbed of personally identifiable information (PII). FI used these data to demonstrate how text mining and machine learning could strengthen a fraud, waste and abuse detection system.

This classification tree helps explain the relationship among terms used in 30742 individual consumer complaints filed with the CFPB. Text mining techniques like this can help you understand high level interactions and design your approach.

4 distinct groups:

Loan Payments
Bank Accounts
Credit Reprts
Debt Collection

Debt Collection:

Within the bottom tree, you can infer that there are a group of consumers who are refusing to pay or having difficulty paying their debts as indicated by the branch with “now, can, will, never, year, get”, which is connected to the words “debt” and “pay.” Many communications are likely occuring via the phone as indicated by the word “call” at the base of the tree.

Classifying Complaints

No organization has the resources to review every case for fraud, waste and abuse; however, machine learning algorithms can be used to leverage past cases or extrapolate from a small sample.

Using the same data, we sampled 10% of the complaints. We then trained a classification algorithm to predict what type of product the consumer is complaining about.

The first complaint was:

“We purchased a home that was being built. We took a loan from US Bank for the full amount of the construction but the lien was placed against our commercial dental office building instead of the new home. When the construction was completed we paid off the bulk of the loan with proceeds from the home we sold and then took a mortgage out on the balance from another bank. The loan was paid off in full on XXXX XXXX, 2015. Since that time, we have been trying to get US Bank to release the lien on our property…”

Our Prediction:

Mortgage

We continued to make predictions like this for the rest of the complaints in the database using a Random Forest algorithm.

Our algorithm was 68% accurate. Given there are 11 possible complaint types, a rough baseline to which we could compare this performance to is 9% (1/11), but our algorithm should be tuned further and/or compared with alternative methods.

Regardless, this methodology can be integrated into fraud, waste and abuse detection systems and enable organizations to leverage limited resources. It can also help them take full advantage of the information they have.

Code

For the technically inclined, the following code chunks demostrate how to perform these tasks and replicate our results.

Preprocessing:

cfpb <- fread("data/cfpb.csv", stringsAsFactors = FALSE)

cfpb_sensoring <- c("xxx", "xxxx", "xxxxx", "xxxxxx", "xxxxxxx", "xxxxxxxx", "xxxxxxxxx", "xxxxxxxxxx")

complaints <- cfpb[cfpb$`Consumer complaint narrative`>0, ]
corp <- Corpus(VectorSource(complaints$`Consumer complaint narrative`))
corp <- corp %>% tm_map(stripWhitespace) %>% tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeWords, stopwords("english")) %>%   tm_map(stemDocument) %>% tm_map(removeWords, cfpb_sensoring)

dtm <- DocumentTermMatrix(corp)
dtm1 <- removeSparseTerms(dtm, 0.77)

Random Forest Algorithm:

rf <- train(Product ~.,
            data = trn[, c(2, 17:122), with=FALSE],
            method = "rf",
            trControl = trainControl(method = "cv", number = 5),
            preProcess = c("center", "scale"))

Heirachical Clustering Algorithm:

d <- dist(t(dtm1), method = "euclidean")
fit <- hclust(d=d, method = "ward.D")
groups <- cutree(fit, k=k)

Dendrogram:

# ggdendro plot
hcdata <- dendro_data(fit)
hcdata$labels$groups <- NULL
hcdata$labels <- merge(hcdata$labels, data.frame(label=names(groups),groups=as.factor(groups)), by = "label")

rect <- aggregate(x~groups,label(hcdata),range)
rect <- data.frame(rect$groups,rect$x)
ymax <- mean(fit$height[length(fit$height)-((k-2):(k-1))])

ggdendrogram(hcdata, rotate = TRUE) + theme_dendro()+
    ggtitle("Heirachical Clustering") +
    geom_text(data=label(hcdata),
               aes(label=label, x=x+.1, y=-45, colour=groups, size =16)) +
    geom_rect(data=rect, aes(xmin=X1-.3, xmax=X2+.3, ymin=-5, ymax=ymax, 
                             colour = rect.groups), fill=NA)+
    geom_abline(intercept=545, slope=0, linetype=3) +
    geom_text(aes(x=13, y = 560, label="k = 4", angle=270)) +
    labs(x = "", y = "") +
    scale_color_manual(values=myPal(k)) +
    scale_x_discrete(breaks=NULL) +
    guides(colour = FALSE, size=FALSE) +
    theme_pander()