library(tidyverse)
library(tidymodels)
library(DT)
library(corrplot)
library(baguette)Predicting Loans Using Tree-Based Models
A Comprehensive Approach to Aplly Tree Models
This project explores tree based models applied to a loans dataset. The analysis includes various grid search methods to optimize model performance.
Decision Trees, Machine Learning, Model Tuning, Loans Dataset, Boosting trees
1 Introduction to the Loans Dataset
The dataset provides historical information on loan applicants and their corresponding default status. Here, we’ll load the dataset and present a first look at its structure and variables.
loans <- read_rds("D:\\dataset\\loan_df.rds")
datatable(loans,rownames = FALSE,caption = "Loans Dataset",options = list(
searching = FALSE,
ordering = TRUE))We use an interactive table to display the loans dataset, allowing for a clearer view of the loan data. This view will serve as the base for exploratory and feature engineering tasks.
glimpse(loans)Rows: 872
Columns: 8
$ loan_default <fct> no, yes, no, no, yes, yes, no, no, no, no, no, yes…
$ loan_purpose <fct> debt_consolidation, medical, small_business, small…
$ missed_payment_2_yr <fct> no, no, no, no, yes, no, no, no, yes, no, no, no, …
$ loan_amount <int> 25000, 10000, 13000, 36000, 12000, 13000, 10000, 4…
$ interest_rate <dbl> 5.47, 10.25, 6.22, 5.97, 11.75, 13.25, 10.47, 7.97…
$ installment <dbl> 855.42, 363.79, 441.73, 1152.01, 307.67, 333.31, 3…
$ annual_income <dbl> 62823, 40000, 65000, 125000, 65000, 87000, 120000,…
$ debt_to_income <dbl> 39.39, 24.06, 13.96, 8.09, 20.13, 18.41, 12.79, 20…
2 Statistical Summary of the Dataset
A statistical summary helps in understanding the basic distributions and ranges of numeric variables in the dataset. This summary includes important measures like mean, standard deviation, and range for each feature.
statistical_summary <- psych::describe(loans)
datatable(statistical_summary,caption = "Loans Dataset Statistical summary",options = list(
searching = FALSE,
ordering = TRUE))3 Correlation Analysis
To investigate relationships among variables, we generate a correlation plot, which helps identify highly correlated features that may be redundant for the model.
correlation <- cor(loans[,c(4:7)],method = "pearson")
corrplot(correlation,method = "number")- We can observe that there’s a high correlation between interest rate and installement variable which can lead to bayesed results. this is a problem we have to deal with.
4 Data Splitting
To ensure a fair evaluation of the model, we split the dataset into training and testing sets. We stratify on loan_default to maintain the same proportion of defaults across both sets.
loans_split <- initial_split(loans,strata = loan_default,prop = .8)
loans_training <- training(loans_split)
loans_test <- testing(loans_split)5 Feature Engineering and Data Preprocessing
Feature engineering is a critical step that prepares the data for model training. Here, we set up a recipe that performs normalization, correlation filtering, and dummy encoding for categorical variables. ## Building a recipe
loans_recipe <- recipe(loans_training,formula = loan_default ~ .) %>%
step_corr(all_numeric_predictors()) %>%
step_dummy(all_nominal(),-all_outcomes()) %>%
step_normalize(all_numeric())5.1 Creating cross validation folds
loans_cv <- vfold_cv(data = loans_training,v = 5)6 Model Selection and Comparison
In this project, we evaluate four tree-based models: Decision Trees, Bagged Trees, Random Forests, and Boosted Trees. Each model has unique characteristics and approaches to handling data, making them suitable for different scenarios. Below is a brief overview of each:
6.1 1. Decision Tree
A basic Decision Tree is a flowchart-like structure where internal nodes represent tests on a feature, branches represent the outcomes, and leaf nodes represent the final predictions. It is easy to interpret but can overfit on complex data.
dtree_model <- decision_tree(tree_depth = tune(),min_n = tune()) %>%
set_mode("classification") %>%
set_engine("rpart")6.2 2. Bagged Trees
Bagged Trees involve creating multiple decision trees using bootstrapped samples of the data and averaging their predictions. This method helps reduce variance and mitigates overfitting.
Note that this is the slowest model due to the nature of the algorithm.
bagged_tree <- bag_tree(tree_depth = tune(),min_n = tune(),cost_complexity = tune()) %>%
set_mode("classification") %>%
set_engine("rpart", times = 100)6.3 3. Random Forest
Random Forests further enhance bagging by adding randomness in feature selection at each split. This helps in reducing the correlation between trees, making it more robust.
random_forest <- rand_forest(trees = tune(),min_n = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")6.4 4. Boosted Trees
Boosted Trees are an ensemble technique where each subsequent tree focuses on the residuals of the previous trees, improving areas where errors were made. This model is known for high accuracy on complex data.
boosted_tree <- boost_tree(tree_depth = tune(),trees = tune(),learn_rate = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")6.5 Hyperparameter Tuning Grids
We use grid search to tune hyperparameters for each model. Each grid search produces a range of parameter combinations, allowing us to select the best-performing set.
dtree_parameters_grid <- grid_random(parameters(dtree_model),size = 15)Warning: `parameters.model_spec()` was deprecated in tune 0.1.6.9003.
ℹ Please use `hardhat::extract_parameter_set_dials()` instead.
bagged_tree_params <- grid_random(parameters(bagged_tree),size = 15)
random_forest_params <- grid_random(parameters(random_forest),size = 15)
boosted_tree_params <- grid_random(parameters(boosted_tree),size = 15)7 Workflow and Model Training
Each model is defined within a workflow, combining the recipe with the model specification. The workflow enables efficient training and validation.
7.1 Decision Tree Workflow
dtree_workflow <- workflow() %>%
add_recipe(loans_recipe) %>%
add_model(dtree_model)7.2 Bagged Tree Workflow
bagged_tree_workflow <- workflow() %>%
add_recipe(loans_recipe) %>%
add_model(bagged_tree)7.3 Random Forest Workflow
randomforest_workflow <- workflow() %>%
add_recipe(loans_recipe) %>%
add_model(random_forest)7.4 Boosted Tree Workflow
boosted_tree_workflow <- workflow() %>%
add_recipe(loans_recipe) %>%
add_model(boosted_tree)8 Model Fitting with Cross-Validation
Each model is fit using cross-validation to assess its performance across different hyperparameter configurations.
8.1 Decision Tree
dtree_fit <- tune_grid(dtree_workflow,grid = dtree_parameters_grid,resamples = loans_cv,metrics = metric_set(roc_auc,accuracy))8.2 Bagged Tree
bagged_tree_fit <- tune_grid(bagged_tree_workflow,grid = bagged_tree_params,resamples = loans_cv)8.3 Random Forest
random_forest_fit <- tune_grid(randomforest_workflow,grid = random_forest_params,resamples = loans_cv)Warning: le package 'ranger' a été compilé avec la version R 4.3.3
9 Boosted Tree
boosted_tree_fit <- tune_grid(boosted_tree_workflow,grid = boosted_tree_params,resamples = loans_cv)Warning: le package 'xgboost' a été compilé avec la version R 4.3.3
10 Model Comparison and Evaluation
Each model’s performance is summarized, displaying metrics such as minimum, maximum, mean, and median for both accuracy and ROC AUC. This allows us to understand the variability and reliability of each model.
rand_forest_results <- random_forest_fit %>%
collect_metrics(summarize = FALSE) %>%
group_by(.metric) %>%
summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
datatable(rand_forest_results,rownames = FALSE,options = list(
searching = FALSE,
ordering = FALSE))decision_tree_results <- dtree_fit %>%
collect_metrics(summarize = FALSE) %>%
group_by(.metric) %>%
summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
datatable(decision_tree_results,rownames = FALSE,options = list(
searching = FALSE,
ordering = FALSE))bagged_trees_results <- bagged_tree_fit %>%
collect_metrics(summarize = FALSE) %>%
group_by(.metric) %>%
summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
datatable(bagged_trees_results,rownames = FALSE,options = list(
searching = FALSE,
ordering = FALSE))boosted_trees_results <- boosted_tree_fit %>%
collect_metrics(summarize = FALSE) %>%
group_by(.metric) %>%
summarise(min = min(.estimate),max = max(.estimate),mean = mean(.estimate),median = median(.estimate)) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
datatable(boosted_trees_results,rownames = FALSE,options = list(
searching = FALSE,
ordering = FALSE))the best model was boosted trees with the highest roc_auc among all models.
11 Selecting and Finalizing the Best Model
From the comparison, we select the boosted tree model as it achieves the highest ROC AUC score. We then finalize this model using the best parameters.
best_model <- select_best(boosted_tree_fit)Warning: No value of `metric` was given; metric 'roc_auc' will be used.
best_model# A tibble: 1 × 4
trees tree_depth learn_rate .config
<int> <int> <dbl> <chr>
1 1678 1 0.0462 Preprocessor1_Model02
12 Final Model Evaluation
After selecting the best model, we evaluate its performance on the test set using metrics such as accuracy, sensitivity, specificity, and ROC curve.
last_workflow <- boosted_tree_workflow %>%
finalize_workflow(best_model)13 Model Last fit
last_model <- last_workflow %>%
last_fit(split = loans_split,metrics = metric_set(roc_auc,accuracy))
last_model %>%
collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.834 Preprocessor1_Model1
2 roc_auc binary 0.895 Preprocessor1_Model1
14 Confusion matrix
last_model %>%
collect_predictions() %>%
conf_mat(truth = loan_default,estimate = .pred_class) Truth
Prediction yes no
yes 49 10
no 19 97
custom_metrics <- metric_set(sens,spec)
predictions <- last_model %>% collect_predictions()
custom_metrics(predictions, truth = loan_default,estimate = .pred_class)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.721
2 spec binary 0.907
15 Roc Curve
last_model %>%
collect_predictions() %>%
roc_curve(truth = loan_default,.pred_yes) %>%
autoplot()16 Feature importance
vip::vip(last_model$.workflow[[1]])- We can observe the most important feature is interest rate.
17 Conclusion
The analysis concludes that the boosted tree model with the selected hyperparameters is the most suitable for predicting loan defaults in this dataset, offering an optimal balance of interpretability and predictive accuracy.