We can use machine learning to predict labels on documents using a classification model. For both types of prediction questions, we develop a learner or model to describe the relationship between a target or outcome variable and our input features; what is different about a classification model is the nature of that outcome.
library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(readr)
set.seed(1234)
Let’s consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau.
Unzip dataset using plyr package
# temporary directory
library(plyr)
<- "/Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data"
my_dir
<- list.files(path = my_dir, pattern = "*.zip", full.names = TRUE)
zip_file
ldply(.data = zip_file, .fun = unzip, exdir = my_dir)
## V1
## 1 /Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data/WA_Fn-UseC_-Telco-Customer-Churn.csv
## 2 /Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data/complaints.csv
2.- Load Data
library(readr)
<- read_csv("data/complaints.csv") complaints
##
## -- Column specification --------------------------------------------------------
## cols(
## `Date received` = col_date(format = ""),
## Product = col_character(),
## `Sub-product` = col_character(),
## Issue = col_character(),
## `Sub-issue` = col_character(),
## `Consumer complaint narrative` = col_character(),
## `Company public response` = col_character(),
## Company = col_character(),
## State = col_character(),
## `ZIP code` = col_character(),
## Tags = col_character(),
## `Consumer consent provided?` = col_character(),
## `Submitted via` = col_character(),
## `Date sent to company` = col_date(format = ""),
## `Company response to consumer` = col_character(),
## `Timely response?` = col_character(),
## `Consumer disputed?` = col_character(),
## `Complaint ID` = col_double()
## )
This data set contains a text field with the complaint, along with information regarding what it was for, how and when it was filed, and the response from the bureau.
library(dplyr)
glimpse(complaints)
## Rows: 2,067,107
## Columns: 18
## $ `Date received` <date> 2019-09-24, 2019-09-19, 2019-11-08, 20~
## $ Product <chr> "Debt collection", "Credit reporting, c~
## $ `Sub-product` <chr> "I do not know", "Credit reporting", "I~
## $ Issue <chr> "Attempts to collect debt not owed", "I~
## $ `Sub-issue` <chr> "Debt is not yours", "Information belon~
## $ `Consumer complaint narrative` <chr> "transworld systems inc. \nis trying to~
## $ `Company public response` <chr> NA, "Company has responded to the consu~
## $ Company <chr> "TRANSWORLD SYSTEMS INC", "Experian Inf~
## $ State <chr> "FL", "PA", "NC", "AZ", "TX", "TX", "MD~
## $ `ZIP code` <chr> "335XX", "15206", "275XX", "85254", "79~
## $ Tags <chr> NA, NA, NA, NA, NA, "Older American", N~
## $ `Consumer consent provided?` <chr> "Consent provided", "Consent not provid~
## $ `Submitted via` <chr> "Web", "Web", "Web", "Referral", "Web",~
## $ `Date sent to company` <date> 2019-09-24, 2019-09-20, 2019-11-08, 20~
## $ `Company response to consumer` <chr> "Closed with explanation", "Closed with~
## $ `Timely response?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes~
## $ `Consumer disputed?` <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A~
## $ `Complaint ID` <dbl> 3384392, 3379500, 3433198, 3255455, 319~
Classification is the method of predicting the class of a given input data point. Classification problems are common in machine learning and they fall under the Supervised learning method.
Under classification we have 2 types:
I will build classification model to predict what type of financial product the complaints are referring to, i.e., a label or categorical variable.
# lets take a look at the data "consumer complain narrative"
head(complaints$`Consumer complaint narrative`)
## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."
## [2] NA
## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."
## [4] NA
## [5] "I am a victim of identity theft. My personal information was compromised and fraudulent charges were included on my accounts without my consent or authorization. On on all accounts I sent this the reporting companies my letter which explained in detail what was fraudulent and needed removal from my file. I even attached the following documents to prove my case : 1. my Federal Trade Commission ID THEFT REPORT ID # XXXX 2.Proof of identity 3. Section 605B of the Fair Credit Reporting Act. But to my surprise, this companies completely ignored me and are still reporting the identified accounts with fraudulent charges on them. Please note that this letter is My final written proof of my intent to reach out to these companies before filing a lawsuit against them. Attached here are the documents I sent them in my earlier dispute with them for reference.."
## [6] NA
library(stringr)
$`Consumer complaint narrative` %>%
complaintsstr_extract_all("\\{\\$[0-9\\.]*\\}") %>%
compact() %>%
head()
## [[1]]
## character(0)
##
## [[2]]
## [1] NA
##
## [[3]]
## character(0)
##
## [[4]]
## [1] NA
##
## [[5]]
## character(0)
##
## [[6]]
## [1] NA
let’s build a binary classification model to predict whether a submitted complaint is about “Credit reporting, credit repair services, or other personal consumer reports” or not.
This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable ‘consumer_complaint_narrative’
Factor outcome variable ‘product’ with 2 levels
1- “credit” 2- “other”
library(tidymodels)
#create levels
<- complaints %>%
complaints2class mutate(Product = factor(if_else(
== paste("Credit Reporting, credit repair services,", "or other personal consumer reports"),
Product "Credit", "Other"
)))
<- initial_split(complaints2class, strata = Product)
complaints_split
<- training(complaints_split)
complaints_train <- testing(complaints_split) complaints_test
dim(complaints_train)
## [1] 1550331 18
dim(complaints_test)
## [1] 516776 18
Recipes Package
Recipes allows to create a specification of preprocessing steps we want to perform.
<- complaints %>%
complaints_rec recipe(Product ~ `Consumer complaint narrative`, data = complaints_train)
Text Recipes
I will use textrecipes to handle the ‘consumer_complaint_narrative’
library(textrecipes)
<- complaints_rec %>%
complaints_rec step_tokenize(`Consumer complaint narrative`) %>%
step_tokenfilter(`Consumer complaint narrative`, max_tokens = 1e3) %>%
step_tfidf(`Consumer complaint narrative`)
<- workflow() %>%
complaint_wf add_recipe(complaints_rec)
library(discrim)
<- naive_Bayes() %>%
nb_spec set_mode("classification") %>%
set_engine("naivebayes")
nb_spec
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
I will use resampling methods to evaluate our model Each of these splits contains information about how to create cross-validation folds
set.seed(234)
<- vfold_cv(complaints_train)
complaints_folds
complaints_folds
## # 10-fold cross-validation
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [1395297/155034]> Fold01
## 2 <split [1395298/155033]> Fold02
## 3 <split [1395298/155033]> Fold03
## 4 <split [1395298/155033]> Fold04
## 5 <split [1395298/155033]> Fold05
## 6 <split [1395298/155033]> Fold06
## 7 <split [1395298/155033]> Fold07
## 8 <split [1395298/155033]> Fold08
## 9 <split [1395298/155033]> Fold09
## 10 <split [1395298/155033]> Fold10
resampling estimates of performance
<- workflow() %>%
nb_wf add_recipe(complaints_rec) %>%
add_model(nb_spec)
nb_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
library(naivebayes)
<- fit_resamples(
nb_rs
nb_wf,
complaints_folds,control = control_resamples(save_pred = TRUE)
)
<- collect_metrics(nb_rs)
nb_rs_metrics
<- collect_predictions(nb_rs) nb_rs_predictions
nb_rs_metrics
%>%
nb_rs_predictions group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "ROC curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)