Data Visualization for bank Loan

Introduction

You can find the data set In here

Business Question

The project aims to visualize and analyze a loan dataset to gain insights into borrower profiles and loan status. By leveraging data visualization techniques, we seek to uncover patterns, trends, and relationships within the dataset, providing valuable information for decision-making and understanding the factors influencing loan outcomes.

The dataset consists of various variables related to borrowers, including loan status, loan amount, credit score, annual income, employment history, home ownership, and more. These variables capture essential aspects of borrowers’ financial profiles and loan characteristics.

The project will employ RStudio and popular packages such as dplyr, ggplot2, and tibble for data wrangling, manipulation, and visualization. By applying exploratory data analysis techniques, we will identify meaningful insights and relationships within the data.

Here are the project goals summarized in bullet points:

Visualize and analyze a loan dataset to gain insights into borrower profiles and loan status.
Identify patterns, trends, and relationships within the dataset using data visualization techniques.
Determine the overall distribution of loan statuses in the dataset.
Analyze the most common loan term (short term or long term) in the dataset.
Explore the distribution of loan purposes and identify the most frequently occurring purposes.
Investigate how credit scores vary across different home ownership types.
Examine the distribution of job stability levels based on different loan purposes.
Analyze the distribution of loan status based on loan term and credit score categories.
Investigate the relationship between debt-to-income ratios and loan purposes, as well as job stability categories.
Determine how home ownership type affects the debt-to-income ratio across different credit score categories.
Explore the distribution of bankruptcies across different loan purposes and job stability categories.

By achieving these goals, the project aims to provide valuable insights for decision-making in the lending industry, financial planning, and policy development.

Overview Dataset

Loan.Status: Indicates the status of the loan, whether it is “Fully Paid” or “Charged Off”.
Current.Loan.Amount: Represents the current loan amount in dollars.
Term: Specifies the term of the loan, either “Short Term” or “Long Term”.
Credit.Score: Represents the credit score of the borrower.
Annual.Income: Indicates the annual income of the borrower in dollars.
Years.in.current.job: Represents the number of years the borrower has been in their current job.
Home.Ownership: Specifies the type of home ownership, such as “Home Mortgage”, “Own Home”, or “Rent”.
Purpose: Indicates the purpose of the loan, such as “Home Improvements” or “Debt Consolidation”.
Monthly.Debt: Represents the monthly debt payments of the borrower.
Years.of.Credit.History: Represents the number of years of credit history.
Number.of.Open.Accounts: Indicates the number of open credit accounts the borrower has.
Number.of.Credit.Problems: Represents the number of credit problems the borrower has faced.
Current.Credit.Balance: Represents the current balance on the borrower’s credit accounts.
Maximum.Open.Credit: Indicates the maximum open credit available to the borrower.
Bankruptcies: Specifies the number of bankruptcies the borrower has filed.
Tax.Liens: Represents the number of tax liens the borrower has.
Debt.To.Income.Ratio: Represents the ratio of debt to income for the borrower.
Credit.Utilization.Ratio: Indicates the ratio of credit utilized by the borrower.
Years.numeric: Represents the number of years the borrower has been in their current job (numeric format).
Job.Stability: Indicates the level of job stability, such as “Stable”, “Moderate”, or “Unstable”.
Credit.Score.Category: Specifies the credit score category of the borrower, such as “Good” or “Very Good”.
Loan.Amount.Category: Represents the loan amount category, such as “Low”, “Medium”, or “High”.

Data Preparation

Import Package & Data Set

Import package

library(dplyr) # The dplyr package is used for data wrangling and manipulation. It provides functions like filter(), mutate(), summarize(), and arrange() for efficient data manipulation.

library(tibble) # The tibble package provides an improved version of data frames in R. Tibbles are easier to work with and offer better printing and subsetting behavior.

library(ggplot2) # The ggplot2 package is a powerful tool for creating data visualizations. It is based on the grammar of graphics and offers a wide range of plotting functions and options.

library(igraph) # The igraph package is used for analyzing and visualizing graph data. It provides functions for working with networks and graphs, including measuring properties, detecting communities, and visualizing graphs using different layout algorithms.

Import Dataset

bankLoan_df <- read.csv(file = "data_input/Bank_loan_train.csv")
rmarkdown::paged_table(bankLoan_df)

Checking Structure using `glimpse()`

The glimpse() function is used to display the structure of the object. The object can be a vector, data frame, list, or other R object.

glimpse(bankLoan_df)

#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID                      <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID                  <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status                  <chr> "Fully Paid", "Fully Paid", "Fully Paid",…
#> $ Current.Loan.Amount          <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term                         <chr> "Short Term", "Short Term", "Short Term",…
#> $ Credit.Score                 <int> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income                <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job         <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership               <chr> "Home Mortgage", "Home Mortgage", "Own Ho…
#> $ Purpose                      <chr> "Home Improvements", "Debt Consolidation"…
#> $ Monthly.Debt                 <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History      <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts      <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems    <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance       <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit          <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies                 <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Data Inspection

Explicit Coercion

The initial stage before conducting data analysis is to ensure that the data used is clean. One of the data cleansing techniques is changing the data type to the correct data type, otherwise known as the term explicit coercion.

# check the retail data structure again

glimpse(bankLoan_df)

#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID                      <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID                  <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status                  <chr> "Fully Paid", "Fully Paid", "Fully Paid",…
#> $ Current.Loan.Amount          <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term                         <chr> "Short Term", "Short Term", "Short Term",…
#> $ Credit.Score                 <int> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income                <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job         <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership               <chr> "Home Mortgage", "Home Mortgage", "Own Ho…
#> $ Purpose                      <chr> "Home Improvements", "Debt Consolidation"…
#> $ Monthly.Debt                 <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History      <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts      <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems    <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance       <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit          <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies                 <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

To change the data type, we can use the as.___() function where ___ is filled with the destination data type. Example:

as. character()
as. Date()
as. integer()
as. numeric()
as. factor()

From the data, some of the columns that I changed the data type are: - Loan.Status -> Factor

#explicit coercion
bankLoan_df <- bankLoan_df %>%
  mutate(Loan.Status = as.factor(Loan.Status),
         Term = as.factor(Term),
         Home.Ownership = as.factor(Home.Ownership),
         Purpose = as.factor(Purpose),
         Bankruptcies = as.factor(Bankruptcies),
         Credit.Score = as.double(Credit.Score)
         )

glimpse(bankLoan_df)

#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID                      <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID                  <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status                  <fct> Fully Paid, Fully Paid, Fully Paid, Fully…
#> $ Current.Loan.Amount          <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term                         <fct> Short Term, Short Term, Short Term, Long …
#> $ Credit.Score                 <dbl> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income                <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job         <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership               <fct> Home Mortgage, Home Mortgage, Own Home, O…
#> $ Purpose                      <fct> Home Improvements, Debt Consolidation, De…
#> $ Monthly.Debt                 <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History      <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts      <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems    <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance       <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit          <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies                 <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Drop “Loan.ID” and “Customer.ID” as they are features for identification

bankLoan_df <- bankLoan_df[, !(names(bankLoan_df) %in% c("Loan.ID", "Customer.ID"))]

Check Missing Value

# Calculate the total and percentage of missing values in each column
missing_data <- data.frame(
  Column = names(bankLoan_df),
  Total = colSums(is.na(bankLoan_df)),
  Percent = colMeans(is.na(bankLoan_df)) * 100
)

# Create a tibble with the missing data information
missing_table <- as_tibble(missing_data)

# Print the missing data table
print(missing_table)

#> # A tibble: 17 × 3
#>    Column                       Total Percent
#>    <chr>                        <dbl>   <dbl>
#>  1 Loan.Status                      0   0    
#>  2 Current.Loan.Amount            514   0.511
#>  3 Term                             0   0    
#>  4 Credit.Score                 19668  19.6  
#>  5 Annual.Income                19668  19.6  
#>  6 Years.in.current.job             0   0    
#>  7 Home.Ownership                   0   0    
#>  8 Purpose                          0   0    
#>  9 Monthly.Debt                   514   0.511
#> 10 Years.of.Credit.History        514   0.511
#> 11 Months.since.last.delinquent 53655  53.4  
#> 12 Number.of.Open.Accounts        514   0.511
#> 13 Number.of.Credit.Problems      514   0.511
#> 14 Current.Credit.Balance         514   0.511
#> 15 Maximum.Open.Credit            516   0.513
#> 16 Bankruptcies                   718   0.714
#> 17 Tax.Liens                      524   0.521

We have: - 51% missing data in total Months since last deliquent. - 19% in both credit score and annual income

Drop the columns from the “bankLoan_df” dataset that have more than 50% missing values.

# Calculate the percentage of missing values in each column
missing_percent <- colMeans(is.na(bankLoan_df)) * 100

# Identify the column names with more than 50% missing values
columns_to_drop <- names(missing_percent[missing_percent > 50])

# Drop the columns with more than 50% missing values
bankLoan_df <- bankLoan_df[, !(names(bankLoan_df) %in% columns_to_drop)]

Drop the row of the column if they have an NA or missing value

# Drop rows with any NA or missing values
bankLoan_df <- bankLoan_df[complete.cases(bankLoan_df), ]

sum(is.na(bankLoan_df))

#> [1] 0

Data Wrangling

Debt-to-Income Ratio

The debt-to-income (DTI) ratio is the percentage of your gross monthly income that goes to paying your monthly debt payments and is used by lenders to determine your borrowing risk.

Personally i calculated by dividing the monthly debt by the annual income.

# Debt-to-Income Ratio
bankLoan_df$Debt.To.Income.Ratio <- bankLoan_df$Monthly.Debt / (bankLoan_df$Annual.Income / 12)

Credit Utilization Ratio

credit utilization ratio, generally expressed as a percentage, represents the amount of revolving credit you’re using divided by the total credit available to you. Lenders use your credit utilization ratio to help determine how well you’re managing your current debt.

I calculated by dividing the current credit balance by the maximum open credit.

# Credit Utilization Ratio
bankLoan_df$Credit.Utilization.Ratio <- bankLoan_df$Current.Credit.Balance / bankLoan_df$Maximum.Open.Credit

Job Stability

I’m categorize the years in the current job into different groups such as “Stable,” “Moderate,” and “Unstable” based on the duration

# Cut the data based on the "Years.in.current.job" variable
bankLoan_df$Years.numeric <- as.numeric(gsub("[^0-9]+", "", bankLoan_df$Years.in.current.job))
bankLoan_df$Years.numeric[grepl("\\+ years", bankLoan_df$Years.in.current.job)] <- 10

# Categorize job stability based on the new column
bankLoan_df$Job.Stability <- cut(bankLoan_df$Years.numeric,
                                 breaks = c(-Inf, 2, 5, Inf),
                                 labels = c("Unstable", "Moderate", "Stable"))
bankLoan_df$Job.Stability <- as.factor(bankLoan_df$Job.Stability)

Credit Score Categorize

Although ranges vary depending on the credit scoring model, generally credit scores from 580 to 669 are considered fair; 670 to 739 are considered good; 740 to 799 are considered very good; and 800 and up are considered excellent.

I create a new column that categorizes the number of credit problems into groups such as “None,” “Low,” “Moderate,” and “High” based on the “Number.of.Credit.Problems” column?

# Create a new column to categorize the credit scores
credit_score_breaks <- c(-Inf, 579, 669, 739, 799, Inf)
credit_score_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
bankLoan_df$Credit.Score.Category <- cut(bankLoan_df$Credit.Score, 
                                         breaks = credit_score_breaks, 
                                         labels = credit_score_labels,
                                         include.lowest = TRUE)
bankLoan_df$Credit.Score.Category <- as.factor(bankLoan_df$Credit.Score.Category)

Categorize the Loan Amount

I we create a new column that categorizes the loan amount into different ranges (e.g., “Low,” “Medium,” “High”) to indicate the size of the loan start from $10,000 - inf

# Create a new column to categorize the loan amount
loan_amount_breaks <- c(0, 100000, 1000000, Inf)
loan_amount_labels <- c("Low", "Medium", "High")
bankLoan_df$Loan.Amount.Category <- cut(bankLoan_df$Current.Loan.Amount,
                                         breaks = loan_amount_breaks,
                                         labels = loan_amount_labels,
                                         include.lowest = TRUE)
bankLoan_df$Loan.Amount.Category <- as.factor(bankLoan_df$Loan.Amount.Category)

Drop (Again) the row of the column if they have an NA or missing value

# Drop rows with any NA or missing values
bankLoan_df <- bankLoan_df[complete.cases(bankLoan_df), ]
head(bankLoan_df)

#>   Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
#> 1  Fully Paid              445412 Short Term          709       1167493
#> 3  Fully Paid            99999999 Short Term          741       2231892
#> 4  Fully Paid              347666  Long Term          721        806949
#> 6 Charged Off              206602 Short Term         7290        896857
#> 7  Fully Paid              217646 Short Term          730       1184194
#> 9  Fully Paid              548746 Short Term          678       2559110
#>   Years.in.current.job Home.Ownership            Purpose Monthly.Debt
#> 1              8 years  Home Mortgage  Home Improvements      5214.74
#> 3              8 years       Own Home Debt Consolidation     29200.53
#> 4              3 years       Own Home Debt Consolidation      8741.90
#> 6            10+ years  Home Mortgage Debt Consolidation     16367.74
#> 7             < 1 year  Home Mortgage Debt Consolidation     10855.08
#> 9              2 years           Rent Debt Consolidation     18660.28
#>   Years.of.Credit.History Number.of.Open.Accounts Number.of.Credit.Problems
#> 1                    17.2                       6                         1
#> 3                    14.9                      18                         1
#> 4                    12.0                       9                         0
#> 6                    17.3                       6                         0
#> 7                    19.6                      13                         1
#> 9                    22.6                       4                         0
#>   Current.Credit.Balance Maximum.Open.Credit Bankruptcies Tax.Liens
#> 1                 228190              416746            1         0
#> 3                 297996              750090            0         0
#> 4                 256329              386958            0         0
#> 6                 215308              272448            0         0
#> 7                 122170              272052            1         0
#> 9                 437171              555038            0         0
#>   Debt.To.Income.Ratio Credit.Utilization.Ratio Years.numeric Job.Stability
#> 1           0.05359936                0.5475517             8        Stable
#> 3           0.15699969                0.3972803             8        Stable
#> 4           0.12999929                0.6624207             3      Moderate
#> 6           0.21900133                0.7902719            10        Stable
#> 7           0.10999968                0.4490686             1      Unstable
#> 9           0.08750048                0.7876416             2      Unstable
#>   Credit.Score.Category Loan.Amount.Category
#> 1                  Good               Medium
#> 3             Very Good                 High
#> 4                  Good               Medium
#> 6             Excellent               Medium
#> 7                  Good               Medium
#> 9                  Good               Medium

Basic Business Question & Data Plotting

1. Corrleation between Column

# Create a graph object
graph <- graph_from_data_frame(bankLoan_df[, c("Loan.Status", "Term", "Home.Ownership", "Purpose")])

# Plot the graph
plot(graph, vertex.label.dist = 2, vertex.size = 10, vertex.label.cex = 0.8, edge.arrow.size = 0.5)

’’’ Insight :

’’’

2. Distribution overall distribution of loan statuses in the dataset

By Understanding the distribution of loan statuses can help users and related parties, such as bank loans, gain insights into the overall performance of loans and assess the risk associated with different loan statuses.

# Create the plot
loanStatusPlot <- ggplot(bankLoan_df, aes(x = Loan.Status, fill = Loan.Status)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Distribution of Loan Statuses",
    x = "Loan Status",
    y = "Count"
  )

# Display the plot
loanStatusPlot

’’’ Insight : - Most of the loan status is already being fully paid with 3 times higher than charged off

’’’ ## 3. What is the most common loan term (short term or long term) in the dataset?

it helps us to identify the dominant loan term in the dataset. Knowing the most common loan term can assist users and related parties, such as bank loans, in understanding the preferred loan duration and tailoring their lending strategies accordingly.

# Create a summary table to calculate the count of each loan term
loanTermSummary <- data.frame(table(bankLoan_df$Term))

# Create the plot
loanTermPlot <- ggplot(loanTermSummary, aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(
    title = "Distribution of Loan Terms",
    x = "Loan Term",
    y = "Count",
    fill = "Loan Term"
  )

# Display the plot
loanTermPlot

’’’ Insight :

Most people using loan for short term only with almost triple the size of the long term loan

’’’

4. What is the distribution of loan purposes in the dataset, and which purpose appears most frequently?

By analyzing the frequency of different loan purposes, users and related parties, such as bank loans, can gain insights into the most common reasons why borrowers seek loans. This information can help lenders tailor their loan products and marketing strategies to better meet the needs of borrowers based on their preferred purposes. Additionally, understanding the distribution of loan purposes can provide insights into market trends and consumer behavior related to borrowing.

# Calculate the loan purpose frequency
loan_purpose_freq <- bankLoan_df %>%
  count(Purpose, sort = TRUE)

# Create the visualization using ggplot
ggplot(loan_purpose_freq, aes(x = Purpose, y = n, fill = Purpose)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(
    title = "Distribution of Loan Purposes",
    x = "Loan Purpose",
    y = "Frequency"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

’’’ Insight :

Most of people using loan for debt Consolidation than other purpose, even when all loan purpose stacked together it still far from debt consolidation
There are very few people using loan for to buy something personal or self-pleasing for example : “vacation, wedding, trip, car, house, and moving”.

’’’

5. How does the distribution of credit scores vary across different home ownership types?

I’m analyzing this distribution, users and related parties, such as bank loans, can assess the creditworthiness and potential risks associated with different home ownership types.

# Create a bar plot to display the distribution of credit score categories by home ownership
ggplot(bankLoan_df, aes(x = Home.Ownership, fill = Credit.Score.Category)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(
    title = "Distribution of Credit Score Categories by Home Ownership",
    x = "Home Ownership",
    y = "Proportion",
    fill = "Credict Score Category"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "right"
  )

’’’ Insight : - For all the type of home ownership looks like the’re not so differrent than other, it means that correlation between proportion and home ownership is very low, cause it not increase or decrease dramatically the credict score category - Majority of home ownership categories have a good Credit Score category

’’’

6 What is the distribution of job stability levels based on different loan purposes?

By Understanding this distribution can help users and related parties, such as bank loans, assess the employment stability of borrowers based on their loan purposes and evaluate the associated risks.

# Create a bar plot to display the distribution of job stability levels by loan purposes
ggplot(bankLoan_df, aes(y = Purpose, fill = Job.Stability)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(
    title = "Distribution of Job Stability Levels by Loan Purposes",
    x = "Loan Purpose",
    y = "Proportion"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "right",
  )

’’’ Insight : - MOst of the people using loan for Educational Expenses has unstable job stability, it may occurs because several variabel that we need to do in-depth research - Most of the people that using loan for renewable_energy has stable job stability. - There’s no significancy different between loan purpose to each other

’’’

7. What is the distribution of loan status (fully paid, charged off) based on the loan term (short term, long term), and how does it vary across different credit score categories?

We try to understanding the distribution of loan status across different credit score categories and loan terms, banks and related parties can gain insights into the creditworthiness of borrowers and assess the risk associated with different types of loans.

# Create a stacked bar plot to display the distribution of loan status by loan term and credit score categories
ggplot(bankLoan_df, aes(x = Term, fill = Loan.Status)) +
  geom_bar(position = "fill") +
  facet_wrap(~Credit.Score.Category) +
  theme_minimal() +
  labs(
    title = "Distribution of Loan Status by Loan Term and Credit Score",
    x = "Loan Term",
    y = "Proportion"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "bottom"
  )

’’’ Insight :

All of the People that has excelent credit score has charged off their loan statu, while the other 3 (very good, good, & fair) has dominated fully paid loan.status

’’’

8. How does the debt-to-income ratio vary for different purposes of the loan (e.g., debt consolidation, home improvements) and across different job stability categories?

This question explores the relationship between the debt-to-income ratio, loan purposes, and job stability. By examining how the debt-to-income ratio differs across loan purposes and job stability categories, banks and related parties can assess the financial health of borrowers and evaluate the potential risks associated with specific loan purposes and job stability levels.

# Create a grouped bar plot to display the debt-to-income ratio by loan purpose and job stability categories
ggplot(bankLoan_df, aes(y = Purpose, x = Debt.To.Income.Ratio, fill = Job.Stability)) +
  geom_bar(stat = "summary", fun = "mean", position = "stack") +
  theme_minimal() +
  labs(
    title = "DIR Ratio by Loan Purpose",
    x = "Debt-to-Income Ratio (Mean)",
    y = "Loan Purpose",
    fill = "Job Stability"
    
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "bottom"
  )

’’’ Insight :

Looks like some people choose to take loan for vacation and debt consolidation eventhough they have unstable job, the bank need to carefull with these because maybe they can’t pay the debt in time
people who has stable job prefer using their loan to use their loan for something important like debt consolidation, business, medical bills rather spend it on personal purchasing

’’’

9. Does home ownership type affect the debt-to-income ratio, and if so, how does it differ across different credit score categories?

This question examines the relationship between home ownership type, debt-to-income ratio, and credit score categories. By analyzing this relationship, banks and related parties can understand the impact of home ownership on borrowers’ debt-to-income ratio and assess how it varies across different credit score categories, providing insights into borrowers’ financial stability.

# Create a box plot to compare the debt-to-income ratio across different home ownership types and credit score categories
ggplot(bankLoan_df, aes(x = Home.Ownership, y = Debt.To.Income.Ratio, fill = Credit.Score.Category)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Debt-to-Income Ratio by Home Ownership Type and Credit Score Category",
    x = "Home Ownership",
    y = "Debt-to-Income Ratio",
    fill = "Credit Score Category"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "bottom"
  )

’’’ Insight :

There is no outlier in home ownership who has have mortage for every credit score category
People who has very good credict score categories and home mortage has the highest outlier than other
for every home ownership type affect the debt-to-income ratio is moderated, and it not so differ across different credit score categories

’’’

10. How does the distribution of bankruptcies vary across different loan purposes and job stability categories?

I’m analyzed this relationship, banks and related parties can assess the impact of loan purposes and job stability on borrowers’ financial distress and evaluate the potential risk associated with specific loan purposes and job stability levels.

# Create a stacked bar plot to compare the distribution of bankruptcies based on loan purposes and job stability categories
ggplot(bankLoan_df, aes(y = Purpose, fill = Bankruptcies)) +
  geom_bar(position = "fill") +
  facet_wrap(~ Job.Stability) +
  theme_minimal() +
  labs(
    title = "Distribution  Bankruptcies by Loan Purposes and Job Stability",
    x = "Proportion",
    y = "Loan Purposes",
    fill = "Bankruptcies"
  ) +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

’’’ Insight :

People who has unstable job stability and using their lane for small_business tends to have highest bankruptcies eventhought it only the small porsion but still it need to be reconsider for bank.
For everyone who use their loan for educational expenses never experienced bankruptcies even once, it means that bank can lend money to people if they used it for educational expenses

’’’

Data Visualization for bank Loan