Project_4_Data_Classification.utf8

Classification Model Metrics

Overview

We can use machine learning to predict labels on documents using a classification model. For both types of prediction questions, we develop a learner or model to describe the relationship between a target or outcome variable and our input features; what is different about a classification model is the nature of that outcome.

A regression model predicts a numeric or continuous value.
A classification model predicts a class label or group membership.

Packages

library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(readr)

set.seed(1234)

Data

Let’s consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau.

Unzip dataset using plyr package

Construct URL Parameter with the desired data to be obtained

# temporary directory

library(plyr)
my_dir <- "/Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data"

zip_file <- list.files(path = my_dir, pattern = "*.zip", full.names = TRUE)

ldply(.data = zip_file, .fun = unzip, exdir = my_dir)

##                                                                                                                                        V1
## 1 /Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data/WA_Fn-UseC_-Telco-Customer-Churn.csv
## 2                       /Users/maria/OneDrive - City University of New York/Documents/R/DATA 607/Projects/Project # 4/data/complaints.csv

2.- Load Data

library(readr)

complaints <- read_csv("data/complaints.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `Date received` = col_date(format = ""),
##   Product = col_character(),
##   `Sub-product` = col_character(),
##   Issue = col_character(),
##   `Sub-issue` = col_character(),
##   `Consumer complaint narrative` = col_character(),
##   `Company public response` = col_character(),
##   Company = col_character(),
##   State = col_character(),
##   `ZIP code` = col_character(),
##   Tags = col_character(),
##   `Consumer consent provided?` = col_character(),
##   `Submitted via` = col_character(),
##   `Date sent to company` = col_date(format = ""),
##   `Company response to consumer` = col_character(),
##   `Timely response?` = col_character(),
##   `Consumer disputed?` = col_character(),
##   `Complaint ID` = col_double()
## )

This data set contains a text field with the complaint, along with information regarding what it was for, how and when it was filed, and the response from the bureau.

library(dplyr)
glimpse(complaints)

## Rows: 2,067,107
## Columns: 18
## $ `Date received`                <date> 2019-09-24, 2019-09-19, 2019-11-08, 20~
## $ Product                        <chr> "Debt collection", "Credit reporting, c~
## $ `Sub-product`                  <chr> "I do not know", "Credit reporting", "I~
## $ Issue                          <chr> "Attempts to collect debt not owed", "I~
## $ `Sub-issue`                    <chr> "Debt is not yours", "Information belon~
## $ `Consumer complaint narrative` <chr> "transworld systems inc. \nis trying to~
## $ `Company public response`      <chr> NA, "Company has responded to the consu~
## $ Company                        <chr> "TRANSWORLD SYSTEMS INC", "Experian Inf~
## $ State                          <chr> "FL", "PA", "NC", "AZ", "TX", "TX", "MD~
## $ `ZIP code`                     <chr> "335XX", "15206", "275XX", "85254", "79~
## $ Tags                           <chr> NA, NA, NA, NA, NA, "Older American", N~
## $ `Consumer consent provided?`   <chr> "Consent provided", "Consent not provid~
## $ `Submitted via`                <chr> "Web", "Web", "Web", "Referral", "Web",~
## $ `Date sent to company`         <date> 2019-09-24, 2019-09-20, 2019-11-08, 20~
## $ `Company response to consumer` <chr> "Closed with explanation", "Closed with~
## $ `Timely response?`             <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes~
## $ `Consumer disputed?`           <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A~
## $ `Complaint ID`                 <dbl> 3384392, 3379500, 3433198, 3255455, 319~

Tidy Data

Classification is the method of predicting the class of a given input data point. Classification problems are common in machine learning and they fall under the Supervised learning method.

Under classification we have 2 types:

Binary Classification
Multi-Class Classification

I will build classification model to predict what type of financial product the complaints are referring to, i.e., a label or categorical variable.

# lets take a look at the data "consumer complain narrative"

head(complaints$`Consumer complaint narrative`)

## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [2] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [4] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [5] "I am a victim of identity theft. My personal information was compromised and fraudulent charges were included on my accounts without my consent or authorization. On on all accounts I sent this the reporting companies my letter which explained in detail what was fraudulent and needed removal from my file. I even attached the following documents to prove my case : 1. my Federal Trade Commission ID THEFT REPORT ID # XXXX 2.Proof of identity 3.  Section 605B of the Fair Credit Reporting Act. But to my surprise, this companies completely ignored me and are still reporting the identified accounts with fraudulent charges on them. Please note that this letter is My final written proof of my intent to reach out to these companies before filing a lawsuit against them. Attached here are the documents I sent them in my earlier dispute with them for reference.."
## [6] NA

library(stringr)
complaints$`Consumer complaint narrative` %>% 
  str_extract_all("\\{\\$[0-9\\.]*\\}") %>% 
  compact() %>% 
  head()

## [[1]]
## character(0)
## 
## [[2]]
## [1] NA
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] NA
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] NA

Model

let’s build a binary classification model to predict whether a submitted complaint is about “Credit reporting, credit repair services, or other personal consumer reports” or not.

This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable ‘consumer_complaint_narrative’

Factor outcome variable ‘product’ with 2 levels

1- “credit” 2- “other”

library(tidymodels)

#create levels

complaints2class <- complaints %>% 
  mutate(Product = factor(if_else(
    Product == paste("Credit Reporting, credit repair services,", "or other personal consumer reports"),
    "Credit", "Other"
  )))

complaints_split <- initial_split(complaints2class, strata = Product)

complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)

dim(complaints_train)

## [1] 1550331      18

dim(complaints_test)

## [1] 516776     18

Recipes Package

Recipes allows to create a specification of preprocessing steps we want to perform.

complaints_rec <- complaints %>% 
  recipe(Product ~ `Consumer complaint narrative`, data = complaints_train)

Text Recipes

I will use textrecipes to handle the ‘consumer_complaint_narrative’

library(textrecipes)

complaints_rec <- complaints_rec %>% 
  step_tokenize(`Consumer complaint narrative`) %>% 
  step_tokenfilter(`Consumer complaint narrative`, max_tokens = 1e3) %>% 
  step_tfidf(`Consumer complaint narrative`)

complaint_wf <- workflow() %>% 
  add_recipe(complaints_rec)

library(discrim)

nb_spec <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("naivebayes")

nb_spec

## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Evaluation

I will use resampling methods to evaluate our model Each of these splits contains information about how to create cross-validation folds

set.seed(234)

complaints_folds <- vfold_cv(complaints_train)

complaints_folds

## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits                   id    
##    <list>                   <chr> 
##  1 <split [1395297/155034]> Fold01
##  2 <split [1395298/155033]> Fold02
##  3 <split [1395298/155033]> Fold03
##  4 <split [1395298/155033]> Fold04
##  5 <split [1395298/155033]> Fold05
##  6 <split [1395298/155033]> Fold06
##  7 <split [1395298/155033]> Fold07
##  8 <split [1395298/155033]> Fold08
##  9 <split [1395298/155033]> Fold09
## 10 <split [1395298/155033]> Fold10

resampling estimates of performance

nb_wf <- workflow() %>% 
  add_recipe(complaints_rec) %>% 
  add_model(nb_spec)

nb_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

library(naivebayes)

nb_rs <- fit_resamples(
  nb_wf,
  complaints_folds,
  control = control_resamples(save_pred = TRUE)
)

nb_rs_metrics <- collect_metrics(nb_rs)

nb_rs_predictions <- collect_predictions(nb_rs)

nb_rs_metrics

nb_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = product, .pred_Credit) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for US Consumer Finance Complaints",
    subtitle = "Each resample fold is shown in a different color"
  )