Wrangling Data in the Tidyverse Course Project

Preparation

Load data_complaints.csv file in R.

Please consider the following before loading the data into R: The data set has 191,096 rows and 18 columns. First calculate a rough estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory (most modern computer should be fine).

Question 1

Consider the column titled Date received and filter rows for all Student loan complaints received after January 1, 2018. Count the number of student loan complaints received by each US state and rank in a decreasing order with the state with the most complaints at the top and the states with the least complaints at the bottom.

How many student loan complaints did the US state with second most complaints have?

library(janitor)
library(tidyverse)
library(lubridate)

complaints <- read_csv("Data_Complaints.csv")
complaints <- janitor::clean_names(complaints)
complaints <- complaints %>% mutate(date_received = mdy(date_received),.after = date_received)
complaints <- complaints %>% mutate(date_sent_to_company = mdy(date_sent_to_company) )
head(complaints)

## # A tibble: 6 x 18
##   date_received product   sub_product   issue   sub_issue    consumer_complaint~
##   <date>        <chr>     <chr>         <chr>   <chr>        <chr>              
## 1 2018-04-19    Credit c~ General-purp~ Closin~ Can't close~  <NA>              
## 2 2019-02-12    Credit c~ General-purp~ Proble~ Credit card~ "I initially in wr~
## 3 2018-04-12    Mortgage  Conventional~ Incorr~ None         "My ex husband and~
## 4 2018-04-18    Student ~ Federal stud~ Strugg~ Problem low~ "I was a student a~
## 5 2018-04-28    Credit c~ General-purp~ Other ~ Other probl~ "It has come to my~
## 6 2017-11-20    Mortgage  Other type o~ Strugg~ None          <NA>              
## # ... with 12 more variables: company_public_response <chr>, company <chr>,
## #   state <chr>, zip_code <chr>, tags <chr>, consumer_consent_provided <chr>,
## #   submitted_via <chr>, date_sent_to_company <date>,
## #   company_response_to_consumer <chr>, timely_response <chr>,
## #   consumer_disputed <chr>, complaint_id <dbl>

Q1data <- complaints %>% filter(product=="Student loan")
Q1data <- Q1data %>% filter(date_received > as.Date("2018-01-01"))
Q1data %>% group_by(state) %>% tally() %>% arrange(desc(n)) %>% head()

## # A tibble: 6 x 2
##   state     n
##   <chr> <int>
## 1 CA     2039
## 2 NY     1374
## 3 FL     1327
## 4 TX     1293
## 5 PA      943
## 6 OH      771

Question 2

Complaints can be submitted in different ways to the CFPB. For example, someone can submit a complaint on the web, but they can also submit a complaint by phone.

Consider all complaints that were submitted by phone.

For this subset of complaints, what is the average amount of time between the complaint being received by CFPB (Date received) and the CFPB sending the complaint to the company (Date sent to company) ?

Round to the nearest tenths digit.

Q2data <- complaints 
Q2data <- Q2data %>% mutate(process_time = as.Date(date_sent_to_company) - as.Date(date_received))
Q2data %>% group_by(submitted_via) %>%summarize(mean_times = mean(process_time,na.rm=T)) %>% mutate(mean_times = round(mean_times,1))

## # A tibble: 6 x 2
##   submitted_via mean_times
##   <chr>         <drtn>    
## 1 Email         1.6 days  
## 2 Fax           1.7 days  
## 3 Phone         2.6 days  
## 4 Postal mail   1.4 days  
## 5 Referral      5.4 days  
## 6 Web           1.0 days

The average amount of time is 2.571 days, which rounds to 2.6 days using the nearest tenths digit.

Question 3

Consider all Mortgage complaints.

How many times does the string “student” appear in the string of text in the Consumer complaint narrative column?

Q3data <- complaints %>% filter(product   == "Mortgage"  )
str_count(Q3data$consumer_complaint_narrative,"student") %>% sum(na.rm=TRUE)

## [1] 250

Q3data <- complaints %>% filter(product   == "Credit card or prepaid card"  )
str_count(Q3data$consumer_complaint_narrative,"student") %>% sum(na.rm=TRUE)

## [1] 222

Q3data <- complaints %>% filter(product   == "Vehicle loan or lease"  )
str_count(Q3data$consumer_complaint_narrative,"student") %>% sum(na.rm=TRUE)

## [1] 49

Question 4

Consider only Vehicle loan or lease complaints.

The Issue column contains a set of plain text categorical variables that the consumer identified in the complaint. For example, one type of issue is “Trouble during payment process” for Mortgage complaints.

For each Issue, calculate what is the average length of characters in the string text in the Consumer complaint narrative column.

Which Issue has the longest complaint length on average?

Q4data <- complaints %>% filter(product   == "Vehicle loan or lease"  )
Q4data <- Q4data %>% mutate(charlength = nchar(consumer_complaint_narrative))
Q4data %>% group_by(issue) %>% summarise(charlength = mean(charlength,na.rm=T)) %>% arrange(desc(charlength))

## # A tibble: 11 x 2
##    issue                                                              charlength
##    <chr>                                                                   <dbl>
##  1 Getting a loan or lease                                                 1660.
##  2 Problems at the end of the loan or lease                                1498.
##  3 Managing the loan or lease                                              1402.
##  4 Struggling to pay your loan                                             1363.
##  5 Incorrect information on your report                                     938.
##  6 Problem with a credit reporting company's investigation into an e~       804.
##  7 Credit monitoring or identity theft protection services                  756 
##  8 Problem with fraud alerts or security freezes                            736.
##  9 Improper use of your report                                              733.
## 10 Unable to get your credit report or credit score                         569.
## 11 Problem with credit report or credit score                               NaN

Question 5

Consider only Mortgage complaints. Consider the two columns Product and Consumer complaint narrative.

For the string text in the Consumer complaint narrative column, drop any rows with NA in the column and remove any strings such as “XX”, “XXX”, “XXXX” in the complaint.

Transform this data frame with two columns into a data frame with three columns (Product, id_number, and word) where the text in the Consumer complaint narrative column gets converted into a format where each row corresponds to one word of the complaint (word) while also keeping track of which complaint we are working with (id_number).

Remove any stop words.

What are the top three most frequently used words for this type of complaint?

Q5data <- complaints %>% filter(product   == "Credit card or prepaid card"  )

str_count(Q5data$consumer_complaint_narrative,"account") %>% sum(na.rm=TRUE)

## [1] 74369

str_count(Q5data$consumer_complaint_narrative,"card") %>% sum(na.rm=TRUE)

## [1] 95536

str_count(Q5data$consumer_complaint_narrative,"credit") %>% sum(na.rm=TRUE)

## [1] 82609

str_count(Q5data$consumer_complaint_narrative,"bank") %>% sum(na.rm=TRUE)

## [1] 26385

str_count(Q5data$consumer_complaint_narrative,"payment") %>% sum(na.rm=TRUE)

## [1] 47020

## Mortage: Home: Loan and Payment
Q5data <- complaints %>% filter(product   == "Mortgage"  )
str_count(Q5data$consumer_complaint_narrative,"loan") %>% sum(na.rm=TRUE)

## [1] 64558

str_count(Q5data$consumer_complaint_narrative,"home") %>% sum(na.rm=TRUE)

## [1] 31643

str_count(Q5data$consumer_complaint_narrative,"payments") %>% sum(na.rm=TRUE)

## [1] 30691

str_count(Q5data$consumer_complaint_narrative,"mortgage") %>% sum(na.rm=TRUE)

## [1] 57267

str_count(Q5data$consumer_complaint_narrative,"payment") %>% sum(na.rm=TRUE)

## [1] 87226