This is the final project students must complete in order to obtain a certificate for the Tidyverse Skills for Data Science in R Specialization offered through Coursera and Carrie Wright, PhD at Johns Hopkins University. “This Specialization is intended for data scientists with some familiarity with the R programming language who are seeking to do data science using the Tidyverse family of packages.” This project represents the final deliverable where “learners will build and organize a data science project from scratch… wrangle non-tidy data into tidy data, visualize data with ggplot2, and build machine learning prediction models.” For this project students were provided a data set with text of a consumer complaint, the associated “product” that the complaint was regarding, and several other text features. The four categories of the “product” are Student loan, Credit cards or prepaid cards, Mortgages, Vehicle loan or lease. The task is to wrangle the data to build a machine learning model that can accurately predict on unseen data.
The end deliverable is a random forest prediction model. On the initial training subset the prediction gave a 0.9844 accuracy estimate, this is reassuring since it does not equal 1. This suggests that the initial model did not over-learn and over-fit the data. Using the resampling technique of predicting on a smaller validation data set the model provided an accuracy estimate of 0.9120.
The training data csv file is 90975, 4 in dimension, after na observations were omitted, using read_csv from the readr package. Out of the 5 features students could use, in this analysis, only the consumer complaint narrative as a predictor was considered. Using the skimr package:
| skim_type | skim_variable | n_missing | complete_rate | factor.ordered | factor.n_unique | factor.top_counts |
|---|---|---|---|---|---|---|
| factor | product | 0 | 1 | FALSE | 4 | cre: 38294, mor: 30957, stu: 12485, veh: 9239 |
| factor | complaint | 0 | 1 | FALSE | 90240 | Thi: 125, I s: 45, Thi: 39, I w: 22 |
| factor | state | 0 | 1 | FALSE | 62 | CA: 13552, FL: 8381, TX: 7114, NY: 5916 |
| factor | company | 0 | 1 | FALSE | 1497 | CIT: 6520, CAP: 5808, Nav: 5330, JPM: 4978 |
From the website of Text Mining with R: “The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents”. The tf_idf for each of the outcome classifiers via the tidytext and ggplot2 package.
Figure 1
We notice a very low rate for credit_card, highest rate for student_loan, and similar rates for mortgage and vehicle. Intuitively this makes sense: credit_card would have the lowest document frequencies due to the number of unique companies and words associated with credit cards and prepaid cards.
Using the capabilities of the rsample package our tidied training data is split for resampling purposes. The dimensions of the new training data are 68226, 3. While the dimension of the validation data set is 22742, 3. The validation set is used to check for any over-fit. Additionally 10 folds were created from the training subset to tune the hyperparameters inside the fitted model.
| product | train_n | train_prop | sub_train_n | sub_train_prop | valid_n | valid_prop |
|---|---|---|---|---|---|---|
| credit_card | 38292 | 0.4209 | 28731 | 0.4211 | 9561 | 0.4204 |
| mortgage | 30955 | 0.3403 | 23225 | 0.3404 | 7730 | 0.3399 |
| student_loan | 12482 | 0.1372 | 9355 | 0.1371 | 3127 | 0.1375 |
| vehicle | 9239 | 0.1016 | 6915 | 0.1014 | 2324 | 0.1022 |
Table 2 shows us that the proportions of the Product feature are consistent within each set. We are comfortable keeping the proportions as is. Some thought is given to the idea that the smaller proportioned groups could be under-represented and lead to bias. However, Figure 1 of our EDA we see that each set of complaint words appeared unique to each category. Therefore we are confident that keeping the proportions consistent should not affect the prediction model.
Tidymodels describe recipes as being “designed to help you preprocess your data before training your model.” All steps were done on the complaint feature: tokenized into the tidy one-token-per-row format, two sets of stop_words were removed, the tokens were then limited to 100 unique values, and the term frequency_inverse document frequency (tf_idf) was calculated. The final step of converting to a tf_idf format creates a sparse matrix. Inside the random forest model we initialized the number of predictors equal to 3, the minimum number of data points in a node equal to 10, and the number of trees equal to 100.
From Wikipedia: “random forests or random decision forests is an ensemble learning method for classification… that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.”
rand_forest: The engine is the default set to ranger and the mode is set to classification. It is a simple workflow: it adds the recipe to the model. The workflow is fit to the training subset, prediction accuracy is checked, then the prediction is run again on the validation subset after tuning the hyperparameters.
In Table 3 we have the accuracy measures for the random forest model.
| set | .metric | .estimator | .estimate |
|---|---|---|---|
| train | accuracy | multiclass | 0.9843901 |
| valid | accuracy | multiclass | 0.9119690 |
As part of the tuning process of this model the training subset was split in to 10 samples, with strata set equal to product. The result of collect_metrics gave the best estimate using mtry = 7, min_n = 21, and trees kept at 100.
The final model provided 85% accuracy on the unseen test data set.
| .pred_class |
|---|
| student_loan |
| vehicle |
| student_loan |
| mortgage |
| student_loan |
| credit_card |
| credit_card |
| credit_card |
| student_loan |
| vehicle |
| mortgage |
| mortgage |
| credit_card |
| credit_card |
| mortgage |
| credit_card |
| mortgage |
| student_loan |
| vehicle |
| vehicle |
In the initial analysis the company feature was ignored due to the huge number of unique names. The state feature was ignored because of the 62 unique identifiers. In an effort to improve accuracies in the model we could consider using these two additional features. The company feature would need to be tokenized. Using the term frequency for each company to the product might increase predictive capabilities. Since it is not recommended to use more than 52 unique values for one-hot encoding. The state feature could be limited to the most frequent values which would then limit the number of complaint narratives and companies in the initial data set.
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
library(tidyverse)
library(tidytext)
library(tidyr)
library(tidymodels)
library(textrecipes)
library(parsnip)
library(skimr)
set.seed(42)
train <- read_csv("./data_complaints_train.csv") %>%
select("Product","Consumer complaint narrative","State","Company") %>%
setNames(c("product","complaint","state","company")) %>%
mutate_if(is.character,as.factor) %>%
na.omit()
dim_train <- dim(train)
test <- read_csv("./data_complaints_test.csv")%>%
select("Consumer complaint narrative","State") %>%
setNames(c("complaint","state")) %>%
mutate_if(is.character,as.factor)
train$product <- recode(train$product,
`Credit card or prepaid card` = "credit_card",
`Mortgage`="mortgage",
`Student loan` = "student_loan",
`Vehicle loan or lease`="vehicle")
knitr::kable(skim(train),
caption = "Table 1: Skim of untidy training data.")
product_train <- train %>%
mutate(ID = row_number()) # add ID column
product_train <- product_train %>%
select(product,complaint,ID) %>%
group_by(product,ID) %>%
unnest_tokens(word,complaint) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
filter(str_detect(word,"[a-z]")) %>%
filter(!str_detect(word,"X{1,}")) %>%
filter(!str_detect(word,"x{1,}")) %>%
summarize(complaint = str_c(word, collapse = " ")) %>%
ungroup()
# complaint tf_idf
complaint_unigram <- product_train %>%
unnest_tokens(word,complaint) %>%
count(product,word)
total_count <- complaint_unigram %>%
group_by(product) %>%
summarise(word_total = sum(n))
complaint_words <- left_join(complaint_unigram,total_count)
complaint_tf_idf <- complaint_words %>%
bind_tf_idf(word,product,n)
company_tf_idf <- complaint_tf_idf %>%
arrange(desc(tf_idf))
# tf_idf_plot
complaint_tf_idf %>%
group_by(product) %>%
slice_max(tf_idf,n=10) %>%
ungroup() %>%
ggplot(aes(tf_idf,fct_reorder(word,tf_idf),fill=product)) +
geom_col(show.legend = FALSE) +
facet_wrap(~product, ncol=2,scales="free") +
labs(x="tf_idf",y=NULL,
title="Term Frequency_Inverse Document Frequency")
# split for tuning later ----------
text_split <- initial_split(train, strata = "product")
df_train <- training(text_split)
df_valid <- testing(text_split)
dim_df_train <- dim(df_train)
dim_df_valid <- dim(df_valid)
full_prop <- product_train %>%
select(product) %>%
count(product) %>%
mutate(train_prop = round(n/sum(n),4)) %>%
rename(train_n = n)
train_prop <- df_train %>%
count(product) %>%
mutate(sub_train_prop = round(n/sum(n),4)) %>%
select(n,sub_train_prop) %>%
rename(sub_train_n = n)
valid_prop <- df_valid %>%
count(product) %>%
mutate(valid_prop = round(n/sum(n),4)) %>%
select(n, valid_prop) %>%
rename(valid_n=n)
props <- full_prop %>%
bind_cols(train_prop,valid_prop)
knitr::kable(props, caption = "Table 2: Counts and Proportions")
my_stop_words <- c("XX","XXX","XXXX","xx","xxx","xxxx")
product_recipe <- recipe(product ~ complaint, data = df_train ) %>%
step_tokenize(complaint) %>%
step_stopwords(complaint) %>%
step_stopwords(complaint, custom_stopword_source = my_stop_words) %>%
step_tokenfilter(complaint, max_tokens = 50) %>%
step_tfidf(complaint)
## random forest
rf_model <- rand_forest(mtry = 3, min_n = 10, trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
## random_forest
rf_workflow <- workflow() %>%
add_recipe(product_recipe) %>%
add_model(rf_model)
fit_rf <- fit(rf_workflow,df_train)
## random_forest
rf_pred <- predict(fit_rf, df_train) %>%
bind_cols(predict(fit_rf, df_train, type = "prob")) %>%
bind_cols(df_train %>% select(product)) %>%
rename(credit_card = ".pred_credit_card",
mortgage = ".pred_mortgage",
student_loan = ".pred_student_loan",
vehicle = ".pred_vehicle")
## random_forest
rf_acc <- accuracy(rf_pred, product, .pred_class)
multi_roc_auc <- roc_auc(rf_pred,
product,credit_card,
mortgage,student_loan,vehicle)
rf_valid_pred <- predict(fit_rf, new_data = df_valid) %>%
bind_cols(predict(fit_rf,new_data = df_valid,type="prob")) %>%
bind_cols(df_valid %>% select(product))
rf_valid_acc <- accuracy(rf_valid_pred, product, .pred_class)
acc_tibble <- tibble(union(rf_acc,rf_valid_acc))
subset <- tibble(set = c("train","valid"))
acc_tibble <- bind_cols(subset,acc_tibble)
knitr::kable(acc_tibble, caption = "Table 3: Initial Accuracies")
rf_model <- rand_forest(mtry = tune(), min_n = tune(), trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_workflow <- workflow() %>%
add_recipe(product_recipe,
blueprint =
hardhat::default_recipe_blueprint(allow_novel_levels =
TRUE)) %>%
add_model(rf_model)
set.seed(234)
folds <- vfold_cv(df_train, strata = "product")
set.seed(543)
rf_resample <- rf_workflow %>%
tune_grid(df_valid,
grid = 5,
resamples = folds,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))
# final model + test prediction ---------------------
rf_model <- rand_forest(mtry = 7, min_n = 21, trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_workflow <- workflow() %>%
add_recipe(product_recipe,
blueprint =
hardhat::default_recipe_blueprint(allow_novel_levels =
TRUE)) %>%
add_model(rf_model)
fit_rf <- fit(rf_workflow,df_train)
# final prediction -----------------
knitr::kable(predict(fit_rf, test), caption = "Table 4: Final Class Predictions")