Data Visualization for bank Loan
Introduction
You can find the data set In here
Business Question
The project aims to visualize and analyze a loan dataset to gain insights into borrower profiles and loan status. By leveraging data visualization techniques, we seek to uncover patterns, trends, and relationships within the dataset, providing valuable information for decision-making and understanding the factors influencing loan outcomes.
The dataset consists of various variables related to borrowers,
including loan status, loan amount,
credit score, annual income,
employment history, home ownership, and more.
These variables capture essential aspects of borrowers’ financial
profiles and loan characteristics.
The project will employ RStudio and popular packages such as
dplyr, ggplot2, and tibble for
data wrangling, manipulation, and visualization. By applying exploratory
data analysis techniques, we will identify meaningful insights and
relationships within the data.
Here are the project goals summarized in bullet points:
- Visualize and analyze a loan dataset to gain insights into borrower profiles and loan status.
- Identify patterns, trends, and relationships within the dataset using data visualization techniques.
- Determine the overall distribution of loan statuses in the dataset.
- Analyze the most common loan term (short term or long term) in the dataset.
- Explore the distribution of loan purposes and identify the most frequently occurring purposes.
- Investigate how credit scores vary across different home ownership types.
- Examine the distribution of job stability levels based on different loan purposes.
- Analyze the distribution of loan status based on loan term and credit score categories.
- Investigate the relationship between debt-to-income ratios and loan purposes, as well as job stability categories.
- Determine how home ownership type affects the debt-to-income ratio across different credit score categories.
- Explore the distribution of bankruptcies across different loan purposes and job stability categories.
By achieving these goals, the project aims to provide valuable insights for decision-making in the lending industry, financial planning, and policy development.
Overview Dataset
Loan.Status: Indicates the status of the loan, whether it is “Fully Paid” or “Charged Off”.Current.Loan.Amount: Represents the current loan amount in dollars.Term: Specifies the term of the loan, either “Short Term” or “Long Term”.Credit.Score: Represents the credit score of the borrower.Annual.Income: Indicates the annual income of the borrower in dollars.Years.in.current.job: Represents the number of years the borrower has been in their current job.Home.Ownership: Specifies the type of home ownership, such as “Home Mortgage”, “Own Home”, or “Rent”.Purpose: Indicates the purpose of the loan, such as “Home Improvements” or “Debt Consolidation”.Monthly.Debt: Represents the monthly debt payments of the borrower.Years.of.Credit.History: Represents the number of years of credit history.Number.of.Open.Accounts: Indicates the number of open credit accounts the borrower has.Number.of.Credit.Problems: Represents the number of credit problems the borrower has faced.Current.Credit.Balance: Represents the current balance on the borrower’s credit accounts.Maximum.Open.Credit: Indicates the maximum open credit available to the borrower.Bankruptcies: Specifies the number of bankruptcies the borrower has filed.Tax.Liens: Represents the number of tax liens the borrower has.Debt.To.Income.Ratio: Represents the ratio of debt to income for the borrower.Credit.Utilization.Ratio: Indicates the ratio of credit utilized by the borrower.Years.numeric: Represents the number of years the borrower has been in their current job (numeric format).Job.Stability: Indicates the level of job stability, such as “Stable”, “Moderate”, or “Unstable”.Credit.Score.Category: Specifies the credit score category of the borrower, such as “Good” or “Very Good”.Loan.Amount.Category: Represents the loan amount category, such as “Low”, “Medium”, or “High”.
Data Preparation
Import Package & Data Set
Import package
library(dplyr) # The dplyr package is used for data wrangling and manipulation. It provides functions like filter(), mutate(), summarize(), and arrange() for efficient data manipulation.
library(tibble) # The tibble package provides an improved version of data frames in R. Tibbles are easier to work with and offer better printing and subsetting behavior.
library(ggplot2) # The ggplot2 package is a powerful tool for creating data visualizations. It is based on the grammar of graphics and offers a wide range of plotting functions and options.
library(igraph) # The igraph package is used for analyzing and visualizing graph data. It provides functions for working with networks and graphs, including measuring properties, detecting communities, and visualizing graphs using different layout algorithms.Import Dataset
bankLoan_df <- read.csv(file = "data_input/Bank_loan_train.csv")
rmarkdown::paged_table(bankLoan_df)Checking Structure using glimpse()
The glimpse() function is used to display the
structure of the object. The object can be a vector,
data frame, list, or other R object.
glimpse(bankLoan_df)#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status <chr> "Fully Paid", "Fully Paid", "Fully Paid",…
#> $ Current.Loan.Amount <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term <chr> "Short Term", "Short Term", "Short Term",…
#> $ Credit.Score <int> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership <chr> "Home Mortgage", "Home Mortgage", "Own Ho…
#> $ Purpose <chr> "Home Improvements", "Debt Consolidation"…
#> $ Monthly.Debt <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Data Inspection
Explicit Coercion
The initial stage before conducting data analysis is to ensure that the data used is clean. One of the data cleansing techniques is changing the data type to the correct data type, otherwise known as the term explicit coercion.
# check the retail data structure again
glimpse(bankLoan_df)#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status <chr> "Fully Paid", "Fully Paid", "Fully Paid",…
#> $ Current.Loan.Amount <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term <chr> "Short Term", "Short Term", "Short Term",…
#> $ Credit.Score <int> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership <chr> "Home Mortgage", "Home Mortgage", "Own Ho…
#> $ Purpose <chr> "Home Improvements", "Debt Consolidation"…
#> $ Monthly.Debt <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
To change the data type, we can use the as.___()
function where ___ is filled with the destination data
type. Example:
as. character()as. Date()as. integer()as. numeric()as. factor()
From the data, some of the columns that I changed the data type are: - Loan.Status -> Factor
#explicit coercion
bankLoan_df <- bankLoan_df %>%
mutate(Loan.Status = as.factor(Loan.Status),
Term = as.factor(Term),
Home.Ownership = as.factor(Home.Ownership),
Purpose = as.factor(Purpose),
Bankruptcies = as.factor(Bankruptcies),
Credit.Score = as.double(Credit.Score)
)
glimpse(bankLoan_df)#> Rows: 100,514
#> Columns: 19
#> $ Loan.ID <chr> "14dd8831-6af5-400b-83ec-68e61888a048", "…
#> $ Customer.ID <chr> "981165ec-3274-42f5-a3b4-d104041a9ca9", "…
#> $ Loan.Status <fct> Fully Paid, Fully Paid, Fully Paid, Fully…
#> $ Current.Loan.Amount <int> 445412, 262328, 99999999, 347666, 176220,…
#> $ Term <fct> Short Term, Short Term, Short Term, Long …
#> $ Credit.Score <dbl> 709, NA, 741, 721, NA, 7290, 730, NA, 678…
#> $ Annual.Income <int> 1167493, NA, 2231892, 806949, NA, 896857,…
#> $ Years.in.current.job <chr> "8 years", "10+ years", "8 years", "3 yea…
#> $ Home.Ownership <fct> Home Mortgage, Home Mortgage, Own Home, O…
#> $ Purpose <fct> Home Improvements, Debt Consolidation, De…
#> $ Monthly.Debt <dbl> 5214.74, 33295.98, 29200.53, 8741.90, 206…
#> $ Years.of.Credit.History <dbl> 17.2, 21.1, 14.9, 12.0, 6.1, 17.3, 19.6, …
#> $ Months.since.last.delinquent <int> NA, 8, 29, NA, NA, NA, 10, 8, 33, NA, 76,…
#> $ Number.of.Open.Accounts <int> 6, 35, 18, 9, 15, 6, 13, 15, 4, 20, 16, 2…
#> $ Number.of.Credit.Problems <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Current.Credit.Balance <int> 228190, 229976, 297996, 256329, 253460, 2…
#> $ Maximum.Open.Credit <int> 416746, 850784, 750090, 386958, 427174, 2…
#> $ Bankruptcies <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
#> $ Tax.Liens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Drop “Loan.ID” and “Customer.ID” as they are features for identification
bankLoan_df <- bankLoan_df[, !(names(bankLoan_df) %in% c("Loan.ID", "Customer.ID"))]Check Missing Value
# Calculate the total and percentage of missing values in each column
missing_data <- data.frame(
Column = names(bankLoan_df),
Total = colSums(is.na(bankLoan_df)),
Percent = colMeans(is.na(bankLoan_df)) * 100
)
# Create a tibble with the missing data information
missing_table <- as_tibble(missing_data)
# Print the missing data table
print(missing_table)#> # A tibble: 17 × 3
#> Column Total Percent
#> <chr> <dbl> <dbl>
#> 1 Loan.Status 0 0
#> 2 Current.Loan.Amount 514 0.511
#> 3 Term 0 0
#> 4 Credit.Score 19668 19.6
#> 5 Annual.Income 19668 19.6
#> 6 Years.in.current.job 0 0
#> 7 Home.Ownership 0 0
#> 8 Purpose 0 0
#> 9 Monthly.Debt 514 0.511
#> 10 Years.of.Credit.History 514 0.511
#> 11 Months.since.last.delinquent 53655 53.4
#> 12 Number.of.Open.Accounts 514 0.511
#> 13 Number.of.Credit.Problems 514 0.511
#> 14 Current.Credit.Balance 514 0.511
#> 15 Maximum.Open.Credit 516 0.513
#> 16 Bankruptcies 718 0.714
#> 17 Tax.Liens 524 0.521
We have: - 51% missing data in total Months since last deliquent. - 19% in both credit score and annual income
Drop the columns from the “bankLoan_df” dataset that have more than 50% missing values.
# Calculate the percentage of missing values in each column
missing_percent <- colMeans(is.na(bankLoan_df)) * 100
# Identify the column names with more than 50% missing values
columns_to_drop <- names(missing_percent[missing_percent > 50])
# Drop the columns with more than 50% missing values
bankLoan_df <- bankLoan_df[, !(names(bankLoan_df) %in% columns_to_drop)]Drop the row of the column if they have an NA or missing value
# Drop rows with any NA or missing values
bankLoan_df <- bankLoan_df[complete.cases(bankLoan_df), ]sum(is.na(bankLoan_df))#> [1] 0
Data Wrangling
Debt-to-Income Ratio
The debt-to-income (DTI) ratio is the percentage of your gross monthly income that goes to paying your monthly debt payments and is used by lenders to determine your borrowing risk.
Personally i calculated by dividing the monthly debt by the annual income.
# Debt-to-Income Ratio
bankLoan_df$Debt.To.Income.Ratio <- bankLoan_df$Monthly.Debt / (bankLoan_df$Annual.Income / 12)Credit Utilization Ratio
credit utilization ratio, generally expressed as a percentage, represents the amount of revolving credit you’re using divided by the total credit available to you. Lenders use your credit utilization ratio to help determine how well you’re managing your current debt.
I calculated by dividing the current credit balance by the maximum open credit.
# Credit Utilization Ratio
bankLoan_df$Credit.Utilization.Ratio <- bankLoan_df$Current.Credit.Balance / bankLoan_df$Maximum.Open.CreditJob Stability
I’m categorize the years in the current job into different groups such as “Stable,” “Moderate,” and “Unstable” based on the duration
# Cut the data based on the "Years.in.current.job" variable
bankLoan_df$Years.numeric <- as.numeric(gsub("[^0-9]+", "", bankLoan_df$Years.in.current.job))
bankLoan_df$Years.numeric[grepl("\\+ years", bankLoan_df$Years.in.current.job)] <- 10
# Categorize job stability based on the new column
bankLoan_df$Job.Stability <- cut(bankLoan_df$Years.numeric,
breaks = c(-Inf, 2, 5, Inf),
labels = c("Unstable", "Moderate", "Stable"))
bankLoan_df$Job.Stability <- as.factor(bankLoan_df$Job.Stability)Credit Score Categorize
Although ranges vary depending on the credit scoring model, generally credit scores from 580 to 669 are considered fair; 670 to 739 are considered good; 740 to 799 are considered very good; and 800 and up are considered excellent.
I create a new column that categorizes the number of credit problems into groups such as “None,” “Low,” “Moderate,” and “High” based on the “Number.of.Credit.Problems” column?
# Create a new column to categorize the credit scores
credit_score_breaks <- c(-Inf, 579, 669, 739, 799, Inf)
credit_score_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
bankLoan_df$Credit.Score.Category <- cut(bankLoan_df$Credit.Score,
breaks = credit_score_breaks,
labels = credit_score_labels,
include.lowest = TRUE)
bankLoan_df$Credit.Score.Category <- as.factor(bankLoan_df$Credit.Score.Category)Categorize the Loan Amount
I we create a new column that categorizes the loan amount into different ranges (e.g., “Low,” “Medium,” “High”) to indicate the size of the loan start from $10,000 - inf
# Create a new column to categorize the loan amount
loan_amount_breaks <- c(0, 100000, 1000000, Inf)
loan_amount_labels <- c("Low", "Medium", "High")
bankLoan_df$Loan.Amount.Category <- cut(bankLoan_df$Current.Loan.Amount,
breaks = loan_amount_breaks,
labels = loan_amount_labels,
include.lowest = TRUE)
bankLoan_df$Loan.Amount.Category <- as.factor(bankLoan_df$Loan.Amount.Category)Drop (Again) the row of the column if they have an NA or missing value
# Drop rows with any NA or missing values
bankLoan_df <- bankLoan_df[complete.cases(bankLoan_df), ]
head(bankLoan_df)#> Loan.Status Current.Loan.Amount Term Credit.Score Annual.Income
#> 1 Fully Paid 445412 Short Term 709 1167493
#> 3 Fully Paid 99999999 Short Term 741 2231892
#> 4 Fully Paid 347666 Long Term 721 806949
#> 6 Charged Off 206602 Short Term 7290 896857
#> 7 Fully Paid 217646 Short Term 730 1184194
#> 9 Fully Paid 548746 Short Term 678 2559110
#> Years.in.current.job Home.Ownership Purpose Monthly.Debt
#> 1 8 years Home Mortgage Home Improvements 5214.74
#> 3 8 years Own Home Debt Consolidation 29200.53
#> 4 3 years Own Home Debt Consolidation 8741.90
#> 6 10+ years Home Mortgage Debt Consolidation 16367.74
#> 7 < 1 year Home Mortgage Debt Consolidation 10855.08
#> 9 2 years Rent Debt Consolidation 18660.28
#> Years.of.Credit.History Number.of.Open.Accounts Number.of.Credit.Problems
#> 1 17.2 6 1
#> 3 14.9 18 1
#> 4 12.0 9 0
#> 6 17.3 6 0
#> 7 19.6 13 1
#> 9 22.6 4 0
#> Current.Credit.Balance Maximum.Open.Credit Bankruptcies Tax.Liens
#> 1 228190 416746 1 0
#> 3 297996 750090 0 0
#> 4 256329 386958 0 0
#> 6 215308 272448 0 0
#> 7 122170 272052 1 0
#> 9 437171 555038 0 0
#> Debt.To.Income.Ratio Credit.Utilization.Ratio Years.numeric Job.Stability
#> 1 0.05359936 0.5475517 8 Stable
#> 3 0.15699969 0.3972803 8 Stable
#> 4 0.12999929 0.6624207 3 Moderate
#> 6 0.21900133 0.7902719 10 Stable
#> 7 0.10999968 0.4490686 1 Unstable
#> 9 0.08750048 0.7876416 2 Unstable
#> Credit.Score.Category Loan.Amount.Category
#> 1 Good Medium
#> 3 Very Good High
#> 4 Good Medium
#> 6 Excellent Medium
#> 7 Good Medium
#> 9 Good Medium
Basic Business Question & Data Plotting
1. Corrleation between Column
# Create a graph object
graph <- graph_from_data_frame(bankLoan_df[, c("Loan.Status", "Term", "Home.Ownership", "Purpose")])
# Plot the graph
plot(graph, vertex.label.dist = 2, vertex.size = 10, vertex.label.cex = 0.8, edge.arrow.size = 0.5)’’’ Insight :
’’’
2. Distribution overall distribution of loan statuses in the dataset
By Understanding the distribution of loan statuses can help users and related parties, such as bank loans, gain insights into the overall performance of loans and assess the risk associated with different loan statuses.
# Create the plot
loanStatusPlot <- ggplot(bankLoan_df, aes(x = Loan.Status, fill = Loan.Status)) +
geom_bar() +
theme_minimal() +
labs(
title = "Distribution of Loan Statuses",
x = "Loan Status",
y = "Count"
)
# Display the plot
loanStatusPlot
’’’ Insight : - Most of the loan status is already
being fully paid with 3 times higher than charged off
’’’ ## 3. What is the most common loan term (short term or long term) in the dataset?
it helps us to identify the dominant loan term in the dataset. Knowing the most common loan term can assist users and related parties, such as bank loans, in understanding the preferred loan duration and tailoring their lending strategies accordingly.
# Create a summary table to calculate the count of each loan term
loanTermSummary <- data.frame(table(bankLoan_df$Term))
# Create the plot
loanTermPlot <- ggplot(loanTermSummary, aes(x = Var1, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(
title = "Distribution of Loan Terms",
x = "Loan Term",
y = "Count",
fill = "Loan Term"
)
# Display the plot
loanTermPlot’’’ Insight :
- Most people using loan for short term only with almost triple the size of the long term loan
’’’
4. What is the distribution of loan purposes in the dataset, and which purpose appears most frequently?
By analyzing the frequency of different loan purposes, users and related parties, such as bank loans, can gain insights into the most common reasons why borrowers seek loans. This information can help lenders tailor their loan products and marketing strategies to better meet the needs of borrowers based on their preferred purposes. Additionally, understanding the distribution of loan purposes can provide insights into market trends and consumer behavior related to borrowing.
# Calculate the loan purpose frequency
loan_purpose_freq <- bankLoan_df %>%
count(Purpose, sort = TRUE)
# Create the visualization using ggplot
ggplot(loan_purpose_freq, aes(x = Purpose, y = n, fill = Purpose)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(
title = "Distribution of Loan Purposes",
x = "Loan Purpose",
y = "Frequency"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)’’’ Insight :
- Most of people using loan for debt Consolidation than other purpose, even when all loan purpose stacked together it still far from debt consolidation
- There are very few people using loan for to buy something personal or self-pleasing for example : “vacation, wedding, trip, car, house, and moving”.
’’’
5. How does the distribution of credit scores vary across different home ownership types?
I’m analyzing this distribution, users and related parties, such as bank loans, can assess the creditworthiness and potential risks associated with different home ownership types.
# Create a bar plot to display the distribution of credit score categories by home ownership
ggplot(bankLoan_df, aes(x = Home.Ownership, fill = Credit.Score.Category)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(
title = "Distribution of Credit Score Categories by Home Ownership",
x = "Home Ownership",
y = "Proportion",
fill = "Credict Score Category"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "right"
)’’’ Insight : - For all the type of home ownership looks like the’re not so differrent than other, it means that correlation between proportion and home ownership is very low, cause it not increase or decrease dramatically the credict score category - Majority of home ownership categories have a good Credit Score category
’’’
6 What is the distribution of job stability levels based on different loan purposes?
By Understanding this distribution can help users and related parties, such as bank loans, assess the employment stability of borrowers based on their loan purposes and evaluate the associated risks.
# Create a bar plot to display the distribution of job stability levels by loan purposes
ggplot(bankLoan_df, aes(y = Purpose, fill = Job.Stability)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(
title = "Distribution of Job Stability Levels by Loan Purposes",
x = "Loan Purpose",
y = "Proportion"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "right",
)’’’ Insight : - MOst of the people using loan for Educational Expenses has unstable job stability, it may occurs because several variabel that we need to do in-depth research - Most of the people that using loan for renewable_energy has stable job stability. - There’s no significancy different between loan purpose to each other
’’’
7. What is the distribution of loan status (fully paid, charged off) based on the loan term (short term, long term), and how does it vary across different credit score categories?
We try to understanding the distribution of loan status across different credit score categories and loan terms, banks and related parties can gain insights into the creditworthiness of borrowers and assess the risk associated with different types of loans.
# Create a stacked bar plot to display the distribution of loan status by loan term and credit score categories
ggplot(bankLoan_df, aes(x = Term, fill = Loan.Status)) +
geom_bar(position = "fill") +
facet_wrap(~Credit.Score.Category) +
theme_minimal() +
labs(
title = "Distribution of Loan Status by Loan Term and Credit Score",
x = "Loan Term",
y = "Proportion"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "bottom"
)’’’ Insight :
- All of the People that has excelent credit score has charged off their loan statu, while the other 3 (very good, good, & fair) has dominated fully paid loan.status
’’’
8. How does the debt-to-income ratio vary for different purposes of the loan (e.g., debt consolidation, home improvements) and across different job stability categories?
This question explores the relationship between the debt-to-income ratio, loan purposes, and job stability. By examining how the debt-to-income ratio differs across loan purposes and job stability categories, banks and related parties can assess the financial health of borrowers and evaluate the potential risks associated with specific loan purposes and job stability levels.
# Create a grouped bar plot to display the debt-to-income ratio by loan purpose and job stability categories
ggplot(bankLoan_df, aes(y = Purpose, x = Debt.To.Income.Ratio, fill = Job.Stability)) +
geom_bar(stat = "summary", fun = "mean", position = "stack") +
theme_minimal() +
labs(
title = "DIR Ratio by Loan Purpose",
x = "Debt-to-Income Ratio (Mean)",
y = "Loan Purpose",
fill = "Job Stability"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "bottom"
)’’’ Insight :
- Looks like some people choose to take loan for vacation and debt consolidation eventhough they have unstable job, the bank need to carefull with these because maybe they can’t pay the debt in time
- people who has stable job prefer using their loan to use their loan for something important like debt consolidation, business, medical bills rather spend it on personal purchasing
’’’
9. Does home ownership type affect the debt-to-income ratio, and if so, how does it differ across different credit score categories?
This question examines the relationship between home ownership type, debt-to-income ratio, and credit score categories. By analyzing this relationship, banks and related parties can understand the impact of home ownership on borrowers’ debt-to-income ratio and assess how it varies across different credit score categories, providing insights into borrowers’ financial stability.
# Create a box plot to compare the debt-to-income ratio across different home ownership types and credit score categories
ggplot(bankLoan_df, aes(x = Home.Ownership, y = Debt.To.Income.Ratio, fill = Credit.Score.Category)) +
geom_boxplot() +
theme_minimal() +
labs(
title = "Debt-to-Income Ratio by Home Ownership Type and Credit Score Category",
x = "Home Ownership",
y = "Debt-to-Income Ratio",
fill = "Credit Score Category"
) +
theme(
plot.title = element_text(size = 16, face = "bold"),
legend.position = "bottom"
)’’’ Insight :
- There is no outlier in home ownership who has have mortage for every credit score category
- People who has very good credict score categories and home mortage has the highest outlier than other
- for every home ownership type affect the debt-to-income ratio is moderated, and it not so differ across different credit score categories
’’’
10. How does the distribution of bankruptcies vary across different loan purposes and job stability categories?
I’m analyzed this relationship, banks and related parties can assess the impact of loan purposes and job stability on borrowers’ financial distress and evaluate the potential risk associated with specific loan purposes and job stability levels.
# Create a stacked bar plot to compare the distribution of bankruptcies based on loan purposes and job stability categories
ggplot(bankLoan_df, aes(y = Purpose, fill = Bankruptcies)) +
geom_bar(position = "fill") +
facet_wrap(~ Job.Stability) +
theme_minimal() +
labs(
title = "Distribution Bankruptcies by Loan Purposes and Job Stability",
x = "Proportion",
y = "Loan Purposes",
fill = "Bankruptcies"
) +
theme(
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom"
)’’’ Insight :
- People who has unstable job stability and using their lane for small_business tends to have highest bankruptcies eventhought it only the small porsion but still it need to be reconsider for bank.
- For everyone who use their loan for educational expenses never experienced bankruptcies even once, it means that bank can lend money to people if they used it for educational expenses
’’’