Predicting Loans Using Tree-Based Models

A Comprehensive Approach to Aplly Tree Models

Optimizing Decision Making using Cross validation and hyperparameter tuning methods on a loans dataset

Author

Aouidane Imed Eddine

Published

October 30, 2024

Abstract

This project explores tree based models applied to a loans dataset. The analysis includes various grid search methods to optimize model performance.

Keywords

Decision Trees, Machine Learning, Model Tuning, Loans Dataset, Boosting trees

library(tidyverse)
library(tidymodels)
library(DT)
library(corrplot)
library(baguette)

1 Introduction to the Loans Dataset

The dataset provides historical information on loan applicants and their corresponding default status. Here, we’ll load the dataset and present a first look at its structure and variables.

loans <- read_rds("D:\\dataset\\loan_df.rds")
datatable(loans,rownames = FALSE,caption = "Loans Dataset",options = list(
    searching = FALSE,
    ordering = TRUE))

We use an interactive table to display the loans dataset, allowing for a clearer view of the loan data. This view will serve as the base for exploratory and feature engineering tasks.

glimpse(loans)

Rows: 872
Columns: 8
$ loan_default        <fct> no, yes, no, no, yes, yes, no, no, no, no, no, yes…
$ loan_purpose        <fct> debt_consolidation, medical, small_business, small…
$ missed_payment_2_yr <fct> no, no, no, no, yes, no, no, no, yes, no, no, no, …
$ loan_amount         <int> 25000, 10000, 13000, 36000, 12000, 13000, 10000, 4…
$ interest_rate       <dbl> 5.47, 10.25, 6.22, 5.97, 11.75, 13.25, 10.47, 7.97…
$ installment         <dbl> 855.42, 363.79, 441.73, 1152.01, 307.67, 333.31, 3…
$ annual_income       <dbl> 62823, 40000, 65000, 125000, 65000, 87000, 120000,…
$ debt_to_income      <dbl> 39.39, 24.06, 13.96, 8.09, 20.13, 18.41, 12.79, 20…

2 Statistical Summary of the Dataset

A statistical summary helps in understanding the basic distributions and ranges of numeric variables in the dataset. This summary includes important measures like mean, standard deviation, and range for each feature.

statistical_summary <- psych::describe(loans)
datatable(statistical_summary,caption = "Loans Dataset Statistical summary",options = list(
    searching = FALSE,
    ordering = TRUE))

3 Correlation Analysis

To investigate relationships among variables, we generate a correlation plot, which helps identify highly correlated features that may be redundant for the model.

correlation <- cor(loans[,c(4:7)],method = "pearson")
corrplot(correlation,method = "number")

We can observe that there’s a high correlation between interest rate and installement variable which can lead to bayesed results. this is a problem we have to deal with.

4 Data Splitting

To ensure a fair evaluation of the model, we split the dataset into training and testing sets. We stratify on loan_default to maintain the same proportion of defaults across both sets.

loans_split <- initial_split(loans,strata = loan_default,prop = .8)
loans_training <- training(loans_split)
loans_test <- testing(loans_split)

5 Feature Engineering and Data Preprocessing

Feature engineering is a critical step that prepares the data for model training. Here, we set up a recipe that performs normalization, correlation filtering, and dummy encoding for categorical variables. ## Building a recipe

loans_recipe <- recipe(loans_training,formula = loan_default ~ .) %>% 
  step_corr(all_numeric_predictors()) %>% 
  step_dummy(all_nominal(),-all_outcomes()) %>% 
  step_normalize(all_numeric())

5.1 Creating cross validation folds

loans_cv <- vfold_cv(data = loans_training,v = 5)

6 Model Selection and Comparison

In this project, we evaluate four tree-based models: Decision Trees, Bagged Trees, Random Forests, and Boosted Trees. Each model has unique characteristics and approaches to handling data, making them suitable for different scenarios. Below is a brief overview of each:

6.1 1. Decision Tree

A basic Decision Tree is a flowchart-like structure where internal nodes represent tests on a feature, branches represent the outcomes, and leaf nodes represent the final predictions. It is easy to interpret but can overfit on complex data.

dtree_model <- decision_tree(tree_depth = tune(),min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("rpart")

6.2 2. Bagged Trees

Bagged Trees involve creating multiple decision trees using bootstrapped samples of the data and averaging their predictions. This method helps reduce variance and mitigates overfitting.

Note

Note that this is the slowest model due to the nature of the algorithm.

bagged_tree <- bag_tree(tree_depth = tune(),min_n = tune(),cost_complexity = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("rpart", times = 100)

6.3 3. Random Forest

Random Forests further enhance bagging by adding randomness in feature selection at each split. This helps in reducing the correlation between trees, making it more robust.

random_forest <- rand_forest(trees = tune(),min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

6.4 4. Boosted Trees

Boosted Trees are an ensemble technique where each subsequent tree focuses on the residuals of the previous trees, improving areas where errors were made. This model is known for high accuracy on complex data.

boosted_tree <- boost_tree(tree_depth = tune(),trees = tune(),learn_rate = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

6.5 Hyperparameter Tuning Grids

We use grid search to tune hyperparameters for each model. Each grid search produces a range of parameter combinations, allowing us to select the best-performing set.

dtree_parameters_grid <- grid_random(parameters(dtree_model),size = 15)

Warning: `parameters.model_spec()` was deprecated in tune 0.1.6.9003.
ℹ Please use `hardhat::extract_parameter_set_dials()` instead.

bagged_tree_params <- grid_random(parameters(bagged_tree),size = 15)
random_forest_params <- grid_random(parameters(random_forest),size = 15)
boosted_tree_params <- grid_random(parameters(boosted_tree),size = 15)

7 Workflow and Model Training

Each model is defined within a workflow, combining the recipe with the model specification. The workflow enables efficient training and validation.

7.1 Decision Tree Workflow

dtree_workflow <- workflow() %>% 
  add_recipe(loans_recipe) %>% 
  add_model(dtree_model)

7.2 Bagged Tree Workflow

bagged_tree_workflow <- workflow() %>% 
  add_recipe(loans_recipe) %>% 
  add_model(bagged_tree)

7.3 Random Forest Workflow

randomforest_workflow <- workflow() %>% 
  add_recipe(loans_recipe) %>% 
  add_model(random_forest)

7.4 Boosted Tree Workflow

boosted_tree_workflow <- workflow() %>% 
  add_recipe(loans_recipe) %>% 
  add_model(boosted_tree)

8 Model Fitting with Cross-Validation

Each model is fit using cross-validation to assess its performance across different hyperparameter configurations.

8.1 Decision Tree

dtree_fit <- tune_grid(dtree_workflow,grid = dtree_parameters_grid,resamples = loans_cv,metrics = metric_set(roc_auc,accuracy))

8.2 Bagged Tree

bagged_tree_fit <- tune_grid(bagged_tree_workflow,grid = bagged_tree_params,resamples = loans_cv)

8.3 Random Forest

random_forest_fit <- tune_grid(randomforest_workflow,grid = random_forest_params,resamples = loans_cv)

Warning: le package 'ranger' a été compilé avec la version R 4.3.3

9 Boosted Tree

boosted_tree_fit <- tune_grid(boosted_tree_workflow,grid = boosted_tree_params,resamples = loans_cv)

Warning: le package 'xgboost' a été compilé avec la version R 4.3.3

10 Model Comparison and Evaluation

Each model’s performance is summarized, displaying metrics such as minimum, maximum, mean, and median for both accuracy and ROC AUC. This allows us to understand the variability and reliability of each model.

rand_forest_results <- random_forest_fit %>% 
  collect_metrics(summarize = FALSE) %>% 
  group_by(.metric) %>% 
  summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>% 
  mutate(across(where(is.numeric), ~ round(., 3)))

datatable(rand_forest_results,rownames = FALSE,options = list(
    searching = FALSE,
    ordering = FALSE))

decision_tree_results <- dtree_fit %>% 
  collect_metrics(summarize = FALSE) %>% 
  group_by(.metric) %>% 
  summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>% 
  mutate(across(where(is.numeric), ~ round(., 3)))

datatable(decision_tree_results,rownames = FALSE,options = list(
    searching = FALSE,
    ordering = FALSE))

bagged_trees_results <- bagged_tree_fit %>% 
  collect_metrics(summarize = FALSE) %>% 
  group_by(.metric) %>% 
  summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>% 
  mutate(across(where(is.numeric), ~ round(., 3)))
  
datatable(bagged_trees_results,rownames = FALSE,options = list(
    searching = FALSE,
    ordering = FALSE))

boosted_trees_results <- boosted_tree_fit %>% 
  collect_metrics(summarize = FALSE) %>% 
  group_by(.metric) %>% 
  summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>% 
  mutate(across(where(is.numeric), ~ round(., 3)))
  
datatable(boosted_trees_results,rownames = FALSE,options = list(
    searching = FALSE,
    ordering = FALSE))

Note

the best model was boosted trees with the highest roc_auc among all models.

11 Selecting and Finalizing the Best Model

From the comparison, we select the boosted tree model as it achieves the highest ROC AUC score. We then finalize this model using the best parameters.

best_model <- select_best(boosted_tree_fit)

Warning: No value of `metric` was given; metric 'roc_auc' will be used.

best_model

# A tibble: 1 × 4
  trees tree_depth learn_rate .config              
  <int>      <int>      <dbl> <chr>                
1  1678          1     0.0462 Preprocessor1_Model02

12 Final Model Evaluation

After selecting the best model, we evaluate its performance on the test set using metrics such as accuracy, sensitivity, specificity, and ROC curve.

last_workflow <- boosted_tree_workflow %>% 
  finalize_workflow(best_model)

13 Model Last fit

last_model <- last_workflow %>% 
  last_fit(split = loans_split,metrics = metric_set(roc_auc,accuracy))
last_model %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.834 Preprocessor1_Model1
2 roc_auc  binary         0.895 Preprocessor1_Model1

14 Confusion matrix

last_model %>% 
  collect_predictions() %>% 
  conf_mat(truth = loan_default,estimate = .pred_class)

          Truth
Prediction yes no
       yes  49 10
       no   19 97

custom_metrics <- metric_set(sens,spec)
predictions <- last_model %>% collect_predictions()
custom_metrics(predictions, truth = loan_default,estimate = .pred_class)

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.721
2 spec    binary         0.907

15 Roc Curve

last_model %>% 
  collect_predictions() %>% 
  roc_curve(truth = loan_default,.pred_yes) %>% 
  autoplot()

16 Feature importance

vip::vip(last_model$.workflow[[1]])

We can observe the most important feature is interest rate.

17 Conclusion

The analysis concludes that the boosted tree model with the selected hyperparameters is the most suitable for predicting loan defaults in this dataset, offering an optimal balance of interpretability and predictive accuracy.