In this post, I will be modelling data from the Consumer Complaint Database. My modelling goal is to predict the product (credit card, student loan, mortgage or vehicle lease/loan) based on the consumer complaint narrative. Because there’s more than two products, I will be using a random forest multi classification model.

Let’s load the data.

data_complaints_test<-read.csv("aEBWUxehSGyAVlMXoThsoQ_edf53641edca416fa00a78d9e4b16ced_data_complaints_test.csv")
data_complaints_train<-read.csv("JhHJz2SSRCqRyc9kkgQqxA_8d34147955154de4a6176086946d07b3_data_complaints_train.csv")

data_complaints_train<-data_complaints_train %>% 
  mutate(Product = as.factor(Product)) %>% 
  select(Product,Consumer.complaint.narrative)

Step 1 : Splitting the data

The first thing I did was split the data into training and testing data. I stratified the split by product to make sure there’s good representation for each product. I then wanted to see the count for each product, to see if I should make changes in my recipe.

#set split
set.seed(1234)
split_data<-initial_split(data_complaints_train,prop = 2/3, strata=Product)
  
training_data<-training(split_data)
testing_data<-testing(split_data)
  
#check number of cases for each product
training_data %>% 
  count(Product)
##                       Product     n
## 1 Credit card or prepaid card 25529
## 2                    Mortgage 20638
## 3                Student loan  8323
## 4       Vehicle loan or lease  6159

Because there’s a lot more products that are credit cards and mortgage, I will use upsampling using the step_smote function in the recipe to balance the products.

Step 2 : Preprocessing

Next, I preprocessed the data using a recipe.

#Create recipe
comp_recipe <- recipe(Product~Consumer.complaint.narrative, data=training_data) %>% 
  step_tokenize(Consumer.complaint.narrative) %>% 
  step_stopwords(Consumer.complaint.narrative) %>% 
  step_tokenfilter(Consumer.complaint.narrative) %>% 
  step_tfidf(Consumer.complaint.narrative) %>% 
  step_smote()

Here’s a quick explanation for each step: step_tokenize was used to seperate the complaints into words. Each column was a different word in the complaint. step_stopwords was used to remove words that don’t carry much meaning and have little predictive impact (a, for, the, etc.) step_tokenfilter was used to only keep the top 100 words step_tfidf was used to calculate the importance of each words relative to each product. step_smote was used to upsample the products that had a low representation in the dataset (student loan and vehicle loan).

Step 3 : Creating a multi classification model

I used a Random Forest classification model with 10 trees, using the ranger engine for multi classification. I will be tuning the mtry parameter to determine which one gives the best accuracy.

comp_RF_model <- rand_forest(mtry = tune(),trees = 10) %>% 
  set_mode("classification") %>% 
  set_engine("ranger", importance="impurity")

Step 4: Creating a workflow

comp_wf <- workflow() %>% 
  add_recipe(comp_recipe) %>% 
  add_model(comp_RF_model)

Step 5: Evaluate the best mtry value using vfold cross validation for resampling

vfold_comp <-vfold_cv(data = training_data,v=10)

tune_RF_results <-tune_grid(object = comp_wf,resamples = vfold_comp,grid = 10, metrics=metric_set(accuracy))
tune_RF_results %>% 
  collect_metrics() 
## # A tibble: 10 × 7
##     mtry .metric  .estimator  mean     n std_err .config     
##    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>       
##  1    15 accuracy multiclass 0.860    10 0.00397 Preprocesso…
##  2     7 accuracy multiclass 0.855    10 0.00373 Preprocesso…
##  3    40 accuracy multiclass 0.861    10 0.00356 Preprocesso…
##  4    65 accuracy multiclass 0.858    10 0.00339 Preprocesso…
##  5    90 accuracy multiclass 0.858    10 0.00361 Preprocesso…
##  6    51 accuracy multiclass 0.861    10 0.00394 Preprocesso…
##  7    44 accuracy multiclass 0.860    10 0.00379 Preprocesso…
##  8    27 accuracy multiclass 0.861    10 0.00336 Preprocesso…
##  9    78 accuracy multiclass 0.859    10 0.00372 Preprocesso…
## 10    97 accuracy multiclass 0.859    10 0.00346 Preprocesso…
show_best(tune_RF_results,metric="accuracy", n=1)
## # A tibble: 1 × 7
##    mtry .metric  .estimator  mean     n std_err .config      
##   <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>        
## 1    27 accuracy multiclass 0.861    10 0.00336 Preprocessor…

After collecting the metrics, we can conclude that a mtry value of 27 gave us the highest accuracy and lowest standard error.

Step 6 : Update the model using an mtry value of 27 and increase the number of trees to 500.

#create final RF multiclass model with mtry of 27 and 500 trees
comp_RF_model <- rand_forest(mtry = 27,trees = 500) %>% 
  set_mode("classification") %>% 
  set_engine("ranger", importance="impurity")

#create workflow
comp_wf_final <- workflow() %>% 
  add_recipe(comp_recipe) %>% 
  add_model(comp_RF_model)

#fit model
comp_wf_final_fit<-fit(comp_wf_final,training_data)
comp_wf_final_fit
## ══ Workflow [trained] ═══════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ─────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_tokenize()
## • step_stopwords()
## • step_tokenfilter()
## • step_tfidf()
## • step_smote()
## 
## ── Model ────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~27,      x), num.trees = ~500, importance = ~"impurity", num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      60649 
## Number of independent variables:  100 
## Mtry:                             27 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1222746

Step 7 : Evaluate the model on the testing data and create a confusion matrix

#predict product on test
pred_product <- predict(comp_wf_final_fit,new_data = testing_data)
accuracy(testing_data,truth = Product, estimate = pred_product$.pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.868

I used geom_tile to create ggplot to create the confusion matrix. I know there’s a quicker way to do this but I still really like how this turned out.

#create confusion matrix
conf_mat_test<-bind_cols(testing_data,pred_product=pull(pred_product,.pred_class)) %>% 
  add_count(Product,pred_product) %>% 
  add_count(pred_product) %>% 
  select(pred_product, Product,n,nn) %>% 
  mutate(pred_perc=n/nn) %>%
  distinct() %>% 
  mutate(pred_perc=round(pred_perc,2))
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
conf_mat_test %>% 
  ggplot(aes(x=pred_product, y=Product))+
  geom_tile(aes(fill=pred_perc))+
  geom_text(aes(label=pred_perc))+
  scale_fill_gradient(low="white",high="blue",name="prediction percentage")+
  xlab("predicted product")+
  ylab("product")+
  labs(title="Confusion matrix")

plot of chunk unnamed-chunk-12

This matrix shows the correct prediction percentage for each product. This shows us that our model was good at correctly predicting credit cards (89%) and mortgages (93%), decent at predicting student loans (80%) and not great at predicting vehicle loans/leases (61%). This may be because it was under represented in our testing dataset. Downsampling credit cards and mortgages maybe would have lead to better results compared to upsampling vehicle loans in the recipe (step_smote) .

Step 8: Predict the products in the evaluation dataset

#evaluation prediction
eval_pred<-predict(comp_wf_final_fit,data_complaints_test)