Course: Tidyverse skills for data science Provider: John Hopkins University / Coursera

COURSE 3: PROJECT

This assignment uses data from Consumer Financial Protection Bureau (CFPB) in the United States (US). In this project, you will practice data exploration and data wrangling with the tidyverse R packages using this data.

The CFPB is an independent agency of the United States government that promotes transparency and protects consumers by providing information needed to make decisions when choosing financial institutions including banking institutions, lenders, mortgage services, credit unions, securities firms, foreclosure services, and debt collectors. One of the purposes of the agency is to receive and process complaints and questions about consumer financial products and services.

In this assignment, we will be using the Consumer Complaint Database. You can quickly view, sort and filter the data in your browser. In particular, we will be using data from October 15, 2017 to October 15, 2020, which we have made available below to download:

The codebook for the data includes 18 variables:

Field Name Description Data Type Notes
Date received The date the CFPF received the complaint date and time
Product The type of product the consumer identified in the complaint plain text This field is a categorical variable
Sub-product The type of sub-product the consumer identified in the complaint plain text This field is a categorical variable. Not all Products have Sub-products.
Issue The issue the consumer identified in the complaint plain text This field is a categorical variable. Possible values are dependent on Product.
Sub-issue The sub-issue the consumer identified in the complaint plain text This field is a categorical variable. Possible values are dependent on product and issue. Not all issues have corresponding sub-issues.
Consumer complaint narrative Consumer complaint narrative is the consumer-submitted description of “what happened” from the complaint. Consumers must opt-in to share their narrative. We will not publish the narrative unless the consumer consents, and consumers can opt-out at any time. The CFPB takes reasonable steps to scrub personal information from each complaint that could be used to identify the consumer. plain text Consumers’ descriptions of what happened are included if consumers consent to publishing the description and after we take steps to remove personal information.
Company public response The company’s optional, public-facing response to a consumer’s complaint. Companies can choose to select a response from a pre-set list of options that will be posted on the public database. For example, “Company believes complaint is the result of an isolated error.” plain text The company’s optional, public-facing response to a consumer’s complaint. Companies can choose to select a response from a pre-set list of options that will be posted on the public database. For example, “Company believes complaint is the result of an isolated error.”
Company The complaint is about this company plain text This field is a categorical variable.
State The state of the mailing address provided by the consumer plain text This field is a categorical variable
ZIP Code The mailing ZIP code provided by the consumer plain text Mailing ZIP code provided by the consumer. This field may: i) include the first five digits of a ZIP code; ii) include the first three digits of a ZIP code (if the consumer consented to publication of their complaint narrative); or iii) be blank (if ZIP codes have been submitted with non-numeric values, if there are less than 20,000 people in a given ZIP code, or if the complaint has an address outside of the United States).
Tags Data that supports easier searching and sorting of complaints submitted by or on behalf of consumers plain text
Consumer consent provided? Identifies whether the consumer opted in to publish their complaint narrative. We do not publish the narrative unless the consumer consents and consumers can opt-out at any time. plain text
Submitted via How the complaint was submitted to the CFPB plain text This field is a categorical variable
Date sent to company The date the CFPB sent the complaint to the company date and time
Company response to consumer This is how the company responsed. For example, “Closed with exaplanation.” plain text This field is a categorical variable
Timely response? Whether the company gave a timely response plain text yes/no
Consumer disputed? Whether the consumer disputed the company’s response plain text Yes, No or NA

To analyze the data in R, we need to estimate the memory required for loading the dataset, filter the data based on specific criteria, and then analyze the filtered data. Here’s how to proceed:

Step 1: Estimate Memory Requirement

First, let’s calculate the rough memory estimate for loading the dataset into R.

  1. Number of rows: 191,096
  2. Number of columns: 18

Assuming each cell in the dataset occupies approximately 8 bytes (a rough estimate), we can calculate the memory usage as follows:

\[ \text{Memory} = \text{Number of rows} \times \text{Number of columns} \times \text{Bytes per cell} \]

\[ \text{Memory} = 191,096 \times 18 \times 8 \text{ bytes} \]

\[ \text{Memory} \approx 27,533,824 \text{ bytes} \approx 27.53 \text{ MB} \]

Most modern computers can handle this size of data without any issues.

Getting the data

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(rvest)
## Warning: package 'rvest' was built under R version 4.3.3
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(dplyr)

url <-  "https://d3c33hcgiwev3.cloudfront.net/Rj3GCXMzSNW9xglzM1jVpg_f5aa2a757a614dfabb12d4b73705cd65_data_complaints.csv?Expires=1722556800&Signature=D106p9azWdX7hj~ZxG4dzPcah0aZc~qUdR-9SN4NHSObEE~0BdRwMCr843NUW-usdCeQTP61~vjE2FDK-yn6AXS888mxTTFt9qUga-~KMyu901Ccj1PH79HWv~b9F9k30ygwoiAr36~VjW3ijJbReSJiqAv7ehl7dL-JMO1hmnY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A" 

complaints <- 
  read_csv(file = url) %>% 
  clean_names()
## Rows: 191096 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): Date received, Product, Sub-product, Issue, Sub-issue, Consumer co...
## dbl  (1): Complaint ID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(complaints)
## # A tibble: 6 × 18
##   date_received product       sub_product issue sub_issue consumer_complaint_n…¹
##   <chr>         <chr>         <chr>       <chr> <chr>     <chr>                 
## 1 04/19/18      Credit card … General-pu… Clos… Can't cl…  <NA>                 
## 2 02/12/19      Credit card … General-pu… Prob… Credit c… "I initially in writi…
## 3 04/12/18      Mortgage      Convention… Inco… None      "My ex husband and my…
## 4 04/18/18      Student loan  Federal st… Stru… Problem … "I was a student at X…
## 5 04/28/18      Credit card … General-pu… Othe… Other pr… "It has come to my at…
## 6 11/20/17      Mortgage      Other type… Stru… None       <NA>                 
## # ℹ abbreviated name: ¹​consumer_complaint_narrative
## # ℹ 12 more variables: company_public_response <chr>, company <chr>,
## #   state <chr>, zip_code <chr>, tags <chr>, consumer_consent_provided <chr>,
## #   submitted_via <chr>, date_sent_to_company <chr>,
## #   company_response_to_consumer <chr>, timely_response <chr>,
## #   consumer_disputed <chr>, complaint_id <dbl>
complaints <- 
  complaints %>% 
  mutate(across(c(date_received,
                  date_sent_to_company),~ mdy(.x))) %>%
  mutate(across(c(product,
                  state),as_factor))

Question 1

Consider the column titled Date received and filter rows for all Student loan complaints received after January 1, 2018. Count the number of student loan complaints received by each US state and rank in a decreasing order with the state with the most complaints at the top and the states with the least complaints at the bottom.

How many student loan complaints did the US state with second most complaints have?

sl <- complaints %>%
  select(date_received,
         product,
         state) %>%
  filter(date_received > ymd("2018-01-01"),
         product == "Student loan") %>% 
  group_by(state) %>%
  count(product) %>% 
  arrange(desc(n))

print(sl)
## # A tibble: 61 × 3
## # Groups:   state [61]
##    state product          n
##    <fct> <fct>        <int>
##  1 CA    Student loan  2039
##  2 NY    Student loan  1374
##  3 FL    Student loan  1327
##  4 TX    Student loan  1293
##  5 PA    Student loan   943
##  6 OH    Student loan   771
##  7 GA    Student loan   749
##  8 IL    Student loan   748
##  9 None  Student loan   629
## 10 VA    Student loan   612
## # ℹ 51 more rows

Question 2

Complaints can be submitted in different ways to the CFPB. For example, someone can submit a complaint on the web, but they can also submit a complaint by phone.

Consider all complaints that were submitted by phone.

For this subset of complaints, what is the average amount of time between the complaint being received by CFPB (Date received) and the CFPB sending the complaint to the company (Date sent to company) ?

phone <- complaints %>% 
  filter(submitted_via == "Phone") %>%
  mutate(time_spam = date_sent_to_company - date_received) %>% 
  summarise(average_time = mean(time_spam) %>% 
              round(1))

print(phone)
## # A tibble: 1 × 1
##   average_time
##   <drtn>      
## 1 2.6 days

Question 3

Consider all Mortgage complaints.

How many times does the string “student” appear in the string of text in the Consumer complaint narrative column?

student <- complaints %>% 
  filter(product == "Mortgage") %>%
  mutate(count = str_count(consumer_complaint_narrative,
                           "student")) %>%
  drop_na(count) %>% 
  summarise(total = sum(count))

print(student)
## # A tibble: 1 × 1
##   total
##   <int>
## 1   250

Question 4

Consider only Mortgage complaints.

The Issue column contains a set of plain text categorical variables that the consumer identified in the complaint. For example, one type of issue is “Trouble during payment process” for Mortgage complaints.

For each Issue, calculate what is the average length of characters in the string text in the Consumer complaint narrative column.

Which Issue has the longest complaint length on average?

length <- complaints %>% 
  filter(product == "Mortgage") %>% 
  mutate(lenght = str_length(consumer_complaint_narrative)) %>%
  drop_na(lenght) %>% 
  group_by(issue) %>% 
  summarise(average_length = mean(lenght)) %>% 
  arrange(desc(average_length))

print(length)
## # A tibble: 10 × 2
##    issue                                                          average_length
##    <chr>                                                                   <dbl>
##  1 Closing on a mortgage                                                   2006.
##  2 Struggling to pay mortgage                                              1961.
##  3 Applying for a mortgage or refinancing an existing mortgage             1777.
##  4 Trouble during payment process                                          1719.
##  5 Unable to get your credit report or credit score                        1564.
##  6 Credit monitoring or identity theft protection services                 1237.
##  7 Problem with a credit reporting company's investigation into …          1232.
##  8 Incorrect information on your report                                    1150.
##  9 Improper use of your report                                              890 
## 10 Problem with fraud alerts or security freezes                            657.

Question 5

Consider only Credit card or prepaid card complaints. Consider the two columns Product and Consumer complaint narrative.

For the string text in the Consumer complaint narrative column, drop any rows with NA in the column and remove any strings such as “XX”, “XXX”, “XXXX” in the complaint.

Transform this data frame with two columns into a data frame with three columns (Product, id_number, and word) where the text in the Consumer complaint narrative column gets converted into a format where each row corresponds to one word of the complaint (word) while also keeping track of which complaint we are working with (id_number).

Remove any stop words.

What are the top three most frequently used words for this type of complaint?

frequent <- complaints %>% 
  filter(product == "Credit card or prepaid card") %>% 
  select(product, consumer_complaint_narrative) %>% 
  drop_na(consumer_complaint_narrative) %>% 
  filter(!str_detect(consumer_complaint_narrative, "XX") &
           !str_detect(consumer_complaint_narrative, "XXX") &
           !str_detect(consumer_complaint_narrative, "XXXX")) %>% 
  mutate(id_number = 1:length(product)) %>% 
  unnest_tokens(word,
                consumer_complaint_narrative) %>%
  anti_join(stop_words) %>% 
  count(word, sort = T)
## Joining with `by = join_by(word)`
print(frequent)
## # A tibble: 9,346 × 2
##    word         n
##    <chr>    <int>
##  1 credit    6971
##  2 card      6241
##  3 account   4935
##  4 payment   2078
##  5 bank      2014
##  6 balance   1594
##  7 payments  1463
##  8 late      1370
##  9 time      1343
## 10 company   1264
## # ℹ 9,336 more rows

Round to the nearest tenths digit.