Consumer complaints data are released by the Consumer Financial Protection Bureau (CFPB) after they have been scrubbed of personally identifiable information (PII). FI used these data to demonstrate how text mining and machine learning could strengthen a fraud, waste and abuse detection system.
This classification tree helps explain the relationship among terms used in 30742 individual consumer complaints filed with the CFPB. Text mining techniques like this can help you understand high level interactions and design your approach.
Within the bottom tree, you can infer that there are a group of consumers who are refusing to pay or having difficulty paying their debts as indicated by the branch with “now, can, will, never, year, get”, which is connected to the words “debt” and “pay.” Many communications are likely occuring via the phone as indicated by the word “call” at the base of the tree.
No organization has the resources to review every case for fraud, waste and abuse; however, machine learning algorithms can be used to leverage past cases or extrapolate from a small sample.
Using the same data, we sampled 10% of the complaints. We then trained a classification algorithm to predict what type of product the consumer is complaining about.
We continued to make predictions like this for the rest of the complaints in the database using a Random Forest algorithm.
Our algorithm was 68% accurate. Given there are 11 possible complaint types, a rough baseline to which we could compare this performance to is 9% (1/11), but our algorithm should be tuned further and/or compared with alternative methods.
Regardless, this methodology can be integrated into fraud, waste and abuse detection systems and enable organizations to leverage limited resources. It can also help them take full advantage of the information they have.
For the technically inclined, the following code chunks demostrate how to perform these tasks and replicate our results.
cfpb <- fread("data/cfpb.csv", stringsAsFactors = FALSE)
cfpb_sensoring <- c("xxx", "xxxx", "xxxxx", "xxxxxx", "xxxxxxx", "xxxxxxxx", "xxxxxxxxx", "xxxxxxxxxx")
complaints <- cfpb[cfpb$`Consumer complaint narrative`>0, ]
corp <- Corpus(VectorSource(complaints$`Consumer complaint narrative`))
corp <- corp %>% tm_map(stripWhitespace) %>% tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(stemDocument) %>% tm_map(removeWords, cfpb_sensoring)
dtm <- DocumentTermMatrix(corp)
dtm1 <- removeSparseTerms(dtm, 0.77)
rf <- train(Product ~.,
data = trn[, c(2, 17:122), with=FALSE],
method = "rf",
trControl = trainControl(method = "cv", number = 5),
preProcess = c("center", "scale"))
d <- dist(t(dtm1), method = "euclidean")
fit <- hclust(d=d, method = "ward.D")
groups <- cutree(fit, k=k)
# ggdendro plot
hcdata <- dendro_data(fit)
hcdata$labels$groups <- NULL
hcdata$labels <- merge(hcdata$labels, data.frame(label=names(groups),groups=as.factor(groups)), by = "label")
rect <- aggregate(x~groups,label(hcdata),range)
rect <- data.frame(rect$groups,rect$x)
ymax <- mean(fit$height[length(fit$height)-((k-2):(k-1))])
ggdendrogram(hcdata, rotate = TRUE) + theme_dendro()+
ggtitle("Heirachical Clustering") +
geom_text(data=label(hcdata),
aes(label=label, x=x+.1, y=-45, colour=groups, size =16)) +
geom_rect(data=rect, aes(xmin=X1-.3, xmax=X2+.3, ymin=-5, ymax=ymax,
colour = rect.groups), fill=NA)+
geom_abline(intercept=545, slope=0, linetype=3) +
geom_text(aes(x=13, y = 560, label="k = 4", angle=270)) +
labs(x = "", y = "") +
scale_color_manual(values=myPal(k)) +
scale_x_discrete(breaks=NULL) +
guides(colour = FALSE, size=FALSE) +
theme_pander()