Business Problem

The bank gives out loans to customers in the written agreement that over a period of time that money is then returned with interest, it is a major way how the bank makes money however this transaction can poses a major risk. Not everyone who takes a loan pays it back meaning a company may lose out on Hundreds of thousands of dollars if not millions. The bank Wishes that there was some way to indicate wether a potential candidate will pay back the loan or not, then using this metric they are able to avoid massive financial loss in the future

End Goal

The end goal or this project is that we are able to find the indicators through the data that tell us what correlates to a potential financial loss. We will use Data visualizations , Summary Statistics and build a Machine Learning Algorithim that will tell us what factos matter the most

Findings

For our data analysis portion, we focused on visually demonstrating the impact that specific values can have, and how different numbers can yield varied results. When looking at a summary table, the significance of the numbers might not be immediately apparent, so we generated visuals to help you see the weight of these figures. We uncovered several interesting findings, but arguably the most striking is the effect of interest rates themselves. We observed a clear visual distinction between those with interest rates normally around the 7.5% range versus those in the 15% range. There is a substantial difference in the rates at which individuals default on loans. This is significant because we see a distinct pattern: those who tend to default on loans often have interest rates between 10-20%, whereas those with rates between 5-10% tend to pay them back. Armed with this knowledge, the bank can make more educated decisions moving forward regarding their interest rates.

The importance of the reccomendations

The best classification model we chose was the Random Forest model, and we’ll delve into this shortly. When we tested this model, it registered 32 false negatives and 28 false positives. This model was much more encompassing of all the variables compared to others, which did not account for all the different variables as effectively and still produced false results. Out of all the data, this model held up the best in predicting whether someone would default. The expected error on future data is 5.84%, meaning if you were to give out loans to 100 people, this model correctly predicts whether someone will or will not default on a loan under similar conditions nearly 95% of the time.

recommendations to the company on how to reduce loan default rates

The following are our recommendations, supported by data from the logistic regression model. The most critical factors that increase the likelihood of an individual defaulting on a loan are as follows: if it’s a 5-year loan, if the installments are high, if the loan amount is high, and if the interest rates are high. Essentially, the higher these numbers go, the higher the likelihood of a default. Therefore, we recommend reducing these values. For example, if you see that 3 out of these 5 qualities hold true for a candidate, do not approve their loan.The results from the VIP visualization, which show what variables impact the possibility of a loan default, rank these four factors as nearly equally important. Moving forward, it’s crucial to monitor these closely.Potential Impact on Demographics There is a potential demographic issue to consider. Not every demographic has the same abilities or average income or wealth, so they might request longer terms or higher loan amounts with higher interest. If they meet 3 out of the 5 risk factors, you will likely deny them a loan. This could lead to perceptions that your bank is unfairly targeting certain demographics, regardless of the reasoning. While the business might thrive financially, the public image and perception could suffer. We’ve seen similar issues in the past with AI models that, once trained on biased data, exhibited discriminatory behavior.It’s vital to assess whether these decisions are inclusively considering all demographics. Overall, this strategy is beneficial because it saves money, provides clear metrics to guide decisions, and offers assurance that the loans issued are likely to be repaid. This, in turn, secures more revenue for the business, allowing it to invest more in significant corporate social responsibility (CSR) opportunities

Conclusion

As we conclude our analysis, it is important to highlight key insights and directions derived from our extensive data analysis using logistic regression models our findings if certain factors such as length of loan term, loan repayment ratio, total loan amount, and interest rate related to debt High values in each of these categories are important indicators of risk there is a substantial increase in default rates, emphasizing the importance of due diligence when approving loans Based on these findings, we recommend stricter lending standards . For example, if an applicant exhibits three or more of these risks, it may be important to reconsider loan approval to protect the financial health of the bank

Furthermore, our study sheds light on the potential demographic implications of these lending practices. There is a risk that a number of people could be disproportionately influenced by these values, leading to public perceptions of bias or unfairness in our lending programs This could affect our public image . It is important to strike a balance and fairness to ensure that our lending practices do not exclude or disproportionately penalize any group.

In summary, while our data driven approach helps us refine our lending policies and reduces the risk that our loans, it is equally important to keep track and adjust our standards to support ethical standards and public confidence and in so doing not only improve our financial stability but strengthen our reputation as a fair and responsible lender

Descriptive statistics

Why statistics

A quick look at some statistics can provide numerical insights into our current standing. However, since statistics can be skewed in various ways, we will visualize the data later to better convey the story. Finally, before training our model, we will clean and refine our data to enhance the model’s predictive accuracy.

Defualt Precent by Loan Purpose

The bank offers a variety of loans, and we aimed to categorize them by their respective purposes. Our goal was to analyze the proportion of customers who defaulted within each loan purpose group to identify any significant differences. We discovered that the highest default rate was for medical bills, with a staggering 60%. We chose to use the proportion of defaults instead of the absolute number because each group had a different number of borrowers. Using proportions allowed for a more accurate comparison

default_rates <- df %>%
  group_by(loan_purpose) %>% 
  summarise(Customers = n(),
            customers_defaulted = sum(loan_default == 'yes'),
            default_precent = 100 * mean(loan_default == 'yes'))
default_rates

Loan default by term

We applied the same approach here with different metrics. The bank offers two loan terms. We divided them into their respective groups and measured the default rates as proportions. We found that borrowers who choose a 5-year term are twice as likely to default compared to those who select the 3 year term.

df %>% 
  group_by(term) %>% 
  summarise(n_customers= n(),
  customers_default = sum(loan_default == 'yes'),
            default_percent = 100 * mean(loan_default == 'yes'))

The history of interest payments has shown an increase, with monthly car payments now comparable to what used to be reserved for mortgages. We measured whether the amount paid affects the likelihood of default. Our analysis revealed that if the monthly payment exceeds $500, there is a significant increase in defaults.

df %>% 
  group_by(loan_default, application_type) %>% 
  summarise(n_customers = n(),
            
            avg_payment = mean(installment),
            median_payment = median(installment))

Visualization

Loan Defualt Rate by Purpose

This visualization refers to the statistical manipulation we did in the first table. Numbers can tell a story but a visuals can paint a picture. Here we can see the margin by which medical loans are defaulted on and how they are towering over the other loan purpose defaults.

ggplot(data = default_rates, mapping = aes(x = loan_purpose, y = default_precent)) +
    geom_bar(stat = 'identity', fill = '#006EA1', color = 'white') +
    labs(title = 'Loan Default Rate by Purpose of Loan',
         x = 'Loan Purpose',
         y = 'Default Percentage') +
    theme_light()

Loan defaults by Interest Rates

There is a few things going on here. First we broke off our datasets and randomly selected those who defaulted and those who did not and set them to a sample size of 200 for each. We did this to decrease bias so it would fall onto proportion rather than sheer numbers. Then we combined them into one dataframe and visualized it breaking it apart by wether they defaulted or not to show the distinct seperation. We see a clear distinct differnce in the interest rates those that defaulted tend to have higher interest rates on their loans compared to those thatdo not have such high interet rates.

Defaulted <- df %>% 
  filter(loan_default == "yes") %>% 
  sample_n(200)

Loyal <- df %>%
  filter(loan_default == "no") %>% 
  sample_n(200)

sample <- bind_rows(Defaulted, Loyal)

ggplot(sample, mapping = aes(x = interest_rate, fill = loan_default)) +
  geom_histogram(color = "white", bins = 15) +
  labs(title = "Interest rates by loan deafulting", 
       x = " Interest Rates",
       y = 'Number of customers') +
  facet_grid(~loan_default)

Loan default measured by credit history

There is a common assumption in the credit world and it is the longer the credit history of someone the more likely they are to honor their loan payments, and a negative stigma is placed around those who are new to credit assuming they will behave poorly with wealth management since they are new. this visual shows us that it is far from the truth, the graph is nearly identical no mattter the credit history so this shoots down a incorrect assumption

ggplot(data = sample, mapping = aes(x = loan_default,y = years_credit_history, fill =  loan_default)) +
  geom_violin() +
  geom_jitter(width = 0.07, alpha = 0.5) +
  labs(title = "Violin Plot of  by Loan Defualt",x = "Loan Default", y = "Miles per Gallon Highway (hwy)")

**The problem with data is that data by definition are points in the past. Showing and visualizing data in the past does not determine the future nor does it help us save money, instead we can build a system that can predict based on the factors we give it, will someone defualt on a loan or not*

Decision Tree

set.seed(314)

df_split <- initial_split(df, prop = .75, strata = loan_default)

df_training <- df_split %>% 
  training()

df_testing <- df_split %>% 
  testing()

df_folds <- vfold_cv(df_training, v = 5)

df_recipe <- recipe(loan_default ~ ., data = df_training)%>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

df_recipe %>% 
  prep() %>% 
  bake(new_data = df_training)

tree_model <- decision_tree(cost_complexity = tune(),
                            tree_depth = tune(),
                            min_n = tune()) %>% 
  set_engine('rpart') %>%
  set_mode('classification')
tree_model

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
##   min_n = tune()
## 
## Computational engine: rpart

tree_workflow  <- workflow() %>% 
                  add_model(tree_model) %>% 
                  add_recipe(df_recipe)

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          min_n(),
                          levels = 1)

set.seed (314)
tree_tuning <- tree_workflow %>% 
  tune_grid(resamples = df_folds,
            grid = tree_grid)

best<- tree_tuning %>% 
  show_best(metric = 'roc_auc')
best

final_wf <- tree_workflow %>% 
  finalize_workflow(best)
final_wf

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 1e-10
##   tree_depth = 1
##   min_n = 2
## 
## Computational engine: rpart

tree_wf_fit <- final_wf %>% 
  fit(data = df_training)
tree_fit <- tree_wf_fit %>%
  pull_workflow_fit()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

tree_fit

## parsnip model object
## 
## n= 3082 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 3082 1147 no (0.3721609 0.6278391)  
##   2) interest_rate>=0.7085622 702    0 yes (1.0000000 0.0000000) *
##   3) interest_rate< 0.7085622 2380  445 no (0.1869748 0.8130252) *

vip(tree_fit)

rpart.plot(tree_fit$fit, roundint = FALSE)

tree_last_fit <- final_wf %>%
  last_fit(df_split)

tree_last_fit %>%  
  collect_metrics()

tree_predictions <- tree_last_fit %>% collect_predictions()
tree_predictions

tree_predictions %>%
roc_curve(truth = loan_default, .pred_yes) %>%
autoplot()

conf_mat(tree_predictions, truth = loan_default,.pred_class)

##           Truth
## Prediction yes  no
##        yes 223   0
##        no  160 645

Random Forest

set.seed(345)
df_fold <- vfold_cv(df_training, v = 5)

rf_model <- rand_forest(mtry = tune(), trees = tune(),
                        min_n = tune()) %>%
  set_engine('ranger', importance = "impurity") %>% 
  set_mode('classification')
rf_model

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

rf_workflow <- workflow() %>% add_model(rf_model) %>%
add_recipe(df_recipe)
rf_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

set.seed(345)
rf_grid <- grid_random(mtry() %>%
                         range_set(c(5, 14)), trees(),
min_n(),
size = 10)
rf_grid

set.seed(345)
rf_tuning <- rf_workflow %>% 
  tune_grid(resamples = df_fold,grid  = rf_grid)
rf_tuning

rf_tuning %>%
  show_best(metric = 'roc_auc')

best_rf <- rf_tuning %>% 
  select_best(metric = 'roc_auc')
best_rf

final_rf_workflow <- rf_workflow %>% 
  finalize_workflow(best_rf)
final_rf_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 12
##   trees = 1605
##   min_n = 8
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

rf_wf_fit<- final_rf_workflow %>% 
  fit(df_training)
rf_fit <- rf_wf_fit %>%
  pull_workflow_fit()
rf_fit

## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~12L,      x), num.trees = ~1605L, min.node.size = min_rows(~8L, x),      importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1605 
## Sample size:                      3082 
## Number of independent variables:  19 
## Mtry:                             12 
## Target node size:                 8 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.05937512

vip(rf_fit)

rf_last_fit <- final_rf_workflow %>%
  last_fit(df_split)
rf_last_fit %>% 
  collect_metrics()

rf_predictions <- rf_last_fit %>% 
  collect_predictions()
rf_predictions

rf_predictions %>% 
  roc_curve (truth = loan_default, .pred_yes) %>% 
  autoplot()

conf_mat(rf_predictions, truth = loan_default, .pred_class)

##           Truth
## Prediction yes  no
##        yes 322  25
##        no   61 620

Logistic Regression

logistic_model <- logistic_reg() %>% 
  set_engine('glm') %>%
  set_mode('classification')

df_wf <- workflow() %>% 
  add_model(logistic_model) %>% 
  add_recipe(df_recipe)

df_logistic_fit <- df_wf %>% 
  fit(df_training)
df_logistic_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                   (Intercept)                    loan_amount  
##                     8.897e+00                      2.700e+01  
##                   installment                  interest_rate  
##                    -2.440e+01                     -3.051e+00  
##                 annual_income              current_job_years  
##                    -9.866e-04                      1.066e-02  
##                debt_to_income             total_credit_lines  
##                    -2.666e-01                      1.160e-01  
##          years_credit_history       loan_purpose_credit_card  
##                     5.855e-03                     -1.383e+00  
##          loan_purpose_medical    loan_purpose_small_business  
##                    -1.856e+00                      1.309e-01  
## loan_purpose_home_improvement         application_type_joint  
##                     2.667e-02                     -6.777e-01  
##                term_five_year             homeownership_rent  
##                    -1.559e+01                     -7.066e-01  
##             homeownership_own         missed_payment_2_yr_no  
##                    -5.235e-01                      3.176e-01  
##         history_bankruptcy_no           history_tax_liens_no  
##                    -1.739e-01                     -3.531e-01  
## 
## Degrees of Freedom: 3081 Total (i.e. Null);  3062 Residual
## Null Deviance:       4069 
## Residual Deviance: 761.7     AIC: 801.7

df_trained_model <- df_logistic_fit %>% 
  pull_workflow_fit()
df_trained_model

## parsnip model object
## 
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                   (Intercept)                    loan_amount  
##                     8.897e+00                      2.700e+01  
##                   installment                  interest_rate  
##                    -2.440e+01                     -3.051e+00  
##                 annual_income              current_job_years  
##                    -9.866e-04                      1.066e-02  
##                debt_to_income             total_credit_lines  
##                    -2.666e-01                      1.160e-01  
##          years_credit_history       loan_purpose_credit_card  
##                     5.855e-03                     -1.383e+00  
##          loan_purpose_medical    loan_purpose_small_business  
##                    -1.856e+00                      1.309e-01  
## loan_purpose_home_improvement         application_type_joint  
##                     2.667e-02                     -6.777e-01  
##                term_five_year             homeownership_rent  
##                    -1.559e+01                     -7.066e-01  
##             homeownership_own         missed_payment_2_yr_no  
##                    -5.235e-01                      3.176e-01  
##         history_bankruptcy_no           history_tax_liens_no  
##                    -1.739e-01                     -3.531e-01  
## 
## Degrees of Freedom: 3081 Total (i.e. Null);  3062 Residual
## Null Deviance:       4069 
## Residual Deviance: 761.7     AIC: 801.7

vip(df_trained_model)

prediction_cat <- predict(df_logistic_fit, new_data = df_testing)
prediction_cat

prediction_prob<- predict(df_logistic_fit, df_testing, type = 'prob')
prediction_prob

test_results <- df_testing %>% 
  select(loan_default) %>% 
  bind_cols(prediction_cat) %>%
  bind_cols(prediction_prob)

conf_mat(test_results, truth = loan_default,estimate = .pred_class )

##           Truth
## Prediction yes  no
##        yes 351  28
##        no   32 617

— End —

Daniyal Zafar