Abstract

This is the final project students must complete in order to obtain a certificate for the Tidyverse Skills for Data Science in R Specialization offered through Coursera and Carrie Wright, PhD at Johns Hopkins University. “This Specialization is intended for data scientists with some familiarity with the R programming language who are seeking to do data science using the Tidyverse family of packages.” This project represents the final deliverable where “learners will build and organize a data science project from scratch… wrangle non-tidy data into tidy data, visualize data with ggplot2, and build machine learning prediction models.” For this project students were provided a data set with text of a consumer complaint, the associated “product” that the complaint was regarding, and several other text features. The four categories of the “product” are Student loan, Credit cards or prepaid cards, Mortgages, Vehicle loan or lease. The task is to wrangle the data to build a machine learning model that can accurately predict on unseen data.

The end deliverable is a random forest prediction model. On the initial training subset the prediction gave a 0.9844 accuracy estimate, this is reassuring since it does not equal 1. This suggests that the initial model did not over-learn and over-fit the data. Using the resampling technique of predicting on a smaller validation data set the model provided an accuracy estimate of 0.9120.

Data

The training data csv file is 90975, 4 in dimension, after na observations were omitted, using read_csv from the readr package. Out of the 5 features students could use, in this analysis, only the consumer complaint narrative as a predictor was considered. Using the skimr package:

Table 1: Skim of untidy training data.
skim_type	skim_variable	complete_rate	factor.ordered	factor.n_unique	factor.top_counts
factor	product	1	FALSE	4	cre: 38294, mor: 30957, stu: 12485, veh: 9239
factor	complaint	1	FALSE	90240	Thi: 125, I s: 45, Thi: 39, I w: 22
factor	state	1	FALSE	62	CA: 13552, FL: 8381, TX: 7114, NY: 5916
factor	company	1	FALSE	1497	CIT: 6520, CAP: 5808, Nav: 5330, JPM: 4978

Exploratory Data Analysis

From the website of Text Mining with R: “The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents”. The tf_idf for each of the outcome classifiers via the tidytext and ggplot2 package.

Figure 1

We notice a very low rate for credit_card, highest rate for student_loan, and similar rates for mortgage and vehicle. Intuitively this makes sense: credit_card would have the lowest document frequencies due to the number of unique companies and words associated with credit cards and prepaid cards.

Data Splitting

Using the capabilities of the rsample package our tidied training data is split for resampling purposes. The dimensions of the new training data are 68226, 3. While the dimension of the validation data set is 22742, 3. The validation set is used to check for any over-fit. Additionally 10 folds were created from the training subset to tune the hyperparameters inside the fitted model.

Table 2: Counts and Proportions
product	train_n	train_prop	sub_train_n	sub_train_prop	valid_n	valid_prop
credit_card	38292	0.4209	28731	0.4211	9561	0.4204
mortgage	30955	0.3403	23225	0.3404	7730	0.3399
student_loan	12482	0.1372	9355	0.1371	3127	0.1375
vehicle	9239	0.1016	6915	0.1014	2324	0.1022

Table 2 shows us that the proportions of the Product feature are consistent within each set. We are comfortable keeping the proportions as is. Some thought is given to the idea that the smaller proportioned groups could be under-represented and lead to bias. However, Figure 1 of our EDA we see that each set of complaint words appeared unique to each category. Therefore we are confident that keeping the proportions consistent should not affect the prediction model.

Recipe

Tidymodels describe recipes as being “designed to help you preprocess your data before training your model.” All steps were done on the complaint feature: tokenized into the tidy one-token-per-row format, two sets of stop_words were removed, the tokens were then limited to 100 unique values, and the term frequency_inverse document frequency (tf_idf) was calculated. The final step of converting to a tf_idf format creates a sparse matrix. Inside the random forest model we initialized the number of predictors equal to 3, the minimum number of data points in a node equal to 10, and the number of trees equal to 100.

Random Forest Model

From Wikipedia: “random forests or random decision forests is an ensemble learning method for classification… that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.”

rand_forest: The engine is the default set to ranger and the mode is set to classification. It is a simple workflow: it adds the recipe to the model. The workflow is fit to the training subset, prediction accuracy is checked, then the prediction is run again on the validation subset after tuning the hyperparameters.

Initial Predictions and Accuracy

In Table 3 we have the accuracy measures for the random forest model.

Table 3: Initial Accuracies
set	.metric	.estimator	.estimate
train	accuracy	multiclass	0.9843901
valid	accuracy	multiclass	0.9119690

Tuning

As part of the tuning process of this model the training subset was split in to 10 samples, with strata set equal to product. The result of collect_metrics gave the best estimate using mtry = 7, min_n = 21, and trees kept at 100.

Final Predictions

The final model provided 85% accuracy on the unseen test data set.

Table 4: Final Class Predictions
.pred_class
student_loan
vehicle
student_loan
mortgage
student_loan
credit_card
credit_card
credit_card
student_loan
vehicle
mortgage
mortgage
credit_card
credit_card
mortgage
credit_card
mortgage
student_loan
vehicle
vehicle

Conclusion

In the initial analysis the company feature was ignored due to the huge number of unique names. The state feature was ignored because of the 62 unique identifiers. In an effort to improve accuracies in the model we could consider using these two additional features. The company feature would need to be tokenized. Using the term frequency for each company to the product might increase predictive capabilities. Since it is not recommended to use more than 52 unique values for one-hot encoding. The state feature could be limited to the most frequent values which would then limit the number of complaint narratives and companies in the initial data set.

Code

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
library(tidyverse)
library(tidytext)
library(tidyr)
library(tidymodels)
library(textrecipes)
library(parsnip)
library(skimr)
set.seed(42)
train <- read_csv("./data_complaints_train.csv") %>%
    select("Product","Consumer complaint narrative","State","Company") %>%
    setNames(c("product","complaint","state","company")) %>%
    mutate_if(is.character,as.factor) %>%
    na.omit()
dim_train <- dim(train)
test <- read_csv("./data_complaints_test.csv")%>%
    select("Consumer complaint narrative","State") %>%
    setNames(c("complaint","state")) %>%
    mutate_if(is.character,as.factor)
train$product <- recode(train$product,
                        `Credit card or prepaid card` = "credit_card",
                        `Mortgage`="mortgage",
                        `Student loan` = "student_loan",
                        `Vehicle loan or lease`="vehicle")
knitr::kable(skim(train),
             caption = "Table 1: Skim of untidy training data.")
product_train <- train %>%
    mutate(ID = row_number()) # add ID column
product_train <- product_train %>%
    select(product,complaint,ID) %>%
    group_by(product,ID) %>%
    unnest_tokens(word,complaint)  %>%
    anti_join(stop_words) %>%
    filter(!str_detect(word,"\\d")) %>%
    filter(str_detect(word,"[a-z]")) %>%
    filter(!str_detect(word,"X{1,}")) %>%
    filter(!str_detect(word,"x{1,}")) %>%
    summarize(complaint = str_c(word, collapse = " ")) %>%
    ungroup()
# complaint tf_idf
complaint_unigram <- product_train %>% 
    unnest_tokens(word,complaint) %>%
    count(product,word)
total_count <- complaint_unigram %>%
    group_by(product) %>%
    summarise(word_total = sum(n))
complaint_words <- left_join(complaint_unigram,total_count)
complaint_tf_idf <- complaint_words %>%
    bind_tf_idf(word,product,n)
company_tf_idf <- complaint_tf_idf %>%
    arrange(desc(tf_idf))
# tf_idf_plot
complaint_tf_idf %>%
    group_by(product) %>%
    slice_max(tf_idf,n=10) %>%
    ungroup() %>%
    ggplot(aes(tf_idf,fct_reorder(word,tf_idf),fill=product)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~product, ncol=2,scales="free") +
    labs(x="tf_idf",y=NULL,
         title="Term Frequency_Inverse Document Frequency")
# split for tuning later ----------
text_split <- initial_split(train, strata = "product")
df_train <- training(text_split)
df_valid <- testing(text_split)
dim_df_train <- dim(df_train)
dim_df_valid <- dim(df_valid)
full_prop <- product_train %>% 
    select(product) %>%
    count(product) %>%
    mutate(train_prop = round(n/sum(n),4)) %>%
    rename(train_n = n)
train_prop <- df_train %>%
    count(product) %>%
    mutate(sub_train_prop = round(n/sum(n),4)) %>%
    select(n,sub_train_prop) %>%
    rename(sub_train_n = n) 
valid_prop <- df_valid %>%
    count(product) %>%
    mutate(valid_prop = round(n/sum(n),4)) %>%
    select(n, valid_prop) %>%
    rename(valid_n=n)
props <- full_prop %>%
    bind_cols(train_prop,valid_prop)
knitr::kable(props, caption = "Table 2: Counts and Proportions")
my_stop_words <- c("XX","XXX","XXXX","xx","xxx","xxxx")
product_recipe <- recipe(product ~ complaint, data = df_train ) %>%
    step_tokenize(complaint) %>%
    step_stopwords(complaint) %>%
    step_stopwords(complaint, custom_stopword_source = my_stop_words) %>%
    step_tokenfilter(complaint, max_tokens = 50) %>%
    step_tfidf(complaint)
## random forest
rf_model <- rand_forest(mtry = 3, min_n = 10, trees = 100) %>%
    set_engine("ranger") %>%
    set_mode("classification")
## random_forest
rf_workflow <- workflow() %>%
    add_recipe(product_recipe) %>%
    add_model(rf_model)

fit_rf <- fit(rf_workflow,df_train)
## random_forest
rf_pred <- predict(fit_rf, df_train) %>% 
    bind_cols(predict(fit_rf, df_train, type = "prob")) %>% 
    bind_cols(df_train %>% select(product))  %>%
    rename(credit_card = ".pred_credit_card",
           mortgage = ".pred_mortgage",
           student_loan = ".pred_student_loan",
           vehicle = ".pred_vehicle")

## random_forest 
rf_acc <- accuracy(rf_pred, product, .pred_class)
multi_roc_auc <- roc_auc(rf_pred,
                         product,credit_card,
                         mortgage,student_loan,vehicle)

rf_valid_pred <- predict(fit_rf, new_data = df_valid) %>%
    bind_cols(predict(fit_rf,new_data = df_valid,type="prob")) %>%
    bind_cols(df_valid %>% select(product))
rf_valid_acc <- accuracy(rf_valid_pred, product, .pred_class)
acc_tibble <- tibble(union(rf_acc,rf_valid_acc))
subset <- tibble(set = c("train","valid"))
acc_tibble <- bind_cols(subset,acc_tibble)
knitr::kable(acc_tibble, caption = "Table 3: Initial Accuracies")
rf_model <- rand_forest(mtry = tune(), min_n = tune(), trees = 100) %>%
    set_engine("ranger") %>%
    set_mode("classification")
rf_workflow <- workflow() %>%
    add_recipe(product_recipe,
               blueprint =
                   hardhat::default_recipe_blueprint(allow_novel_levels =
                                                         TRUE)) %>%
    add_model(rf_model)
set.seed(234)
folds <- vfold_cv(df_train, strata = "product")
set.seed(543)
rf_resample <- rf_workflow %>%
    tune_grid(df_valid,
          grid = 5,
          resamples = folds,
          control = control_grid(save_pred = TRUE),
          metrics = metric_set(roc_auc))

# final model + test prediction ---------------------
rf_model <- rand_forest(mtry = 7, min_n = 21, trees = 100) %>% 
    set_engine("ranger") %>% 
    set_mode("classification")
rf_workflow <- workflow() %>%
    add_recipe(product_recipe,
               blueprint = 
                   hardhat::default_recipe_blueprint(allow_novel_levels =
                                                         TRUE)) %>%
    add_model(rf_model)
fit_rf <- fit(rf_workflow,df_train)
# final prediction -----------------
knitr::kable(predict(fit_rf, test), caption = "Table 4: Final Class Predictions")

tidyverse_peer_project

Nate Foulkes

3/22/2022