library(tidyverse)
library(dplyr)
library(ggplot2)
library(reshape2)
library(corrplot)
library(randomForest)
library(caret)
library(pROC)
app_data <- read.csv("application_record.csv")
credit_data <- read.csv("credit_record.csv") ## --- Application Record Structure ---
## [1] "ID" "CODE_GENDER" "FLAG_OWN_CAR"
## [4] "FLAG_OWN_REALTY" "CNT_CHILDREN" "AMT_INCOME_TOTAL"
## [7] "NAME_INCOME_TYPE" "NAME_EDUCATION_TYPE" "NAME_FAMILY_STATUS"
## [10] "NAME_HOUSING_TYPE" "DAYS_BIRTH" "DAYS_EMPLOYED"
## [13] "FLAG_MOBIL" "FLAG_WORK_PHONE" "FLAG_PHONE"
## [16] "FLAG_EMAIL" "OCCUPATION_TYPE" "CNT_FAM_MEMBERS"
## --- Credit Record Structure ---
## [1] "ID" "MONTHS_BALANCE" "STATUS"
## 'data.frame': 438557 obs. of 18 variables:
## $ ID : int 5008804 5008805 5008806 5008808 5008809 5008810 5008811 5008812 5008813 5008814 ...
## $ CODE_GENDER : chr "M" "M" "M" "F" ...
## $ FLAG_OWN_CAR : chr "Y" "Y" "Y" "N" ...
## $ FLAG_OWN_REALTY : chr "Y" "Y" "Y" "Y" ...
## $ CNT_CHILDREN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AMT_INCOME_TOTAL : num 427500 427500 112500 270000 270000 ...
## $ NAME_INCOME_TYPE : chr "Working" "Working" "Working" "Commercial associate" ...
## $ NAME_EDUCATION_TYPE: chr "Higher education" "Higher education" "Secondary / secondary special" "Secondary / secondary special" ...
## $ NAME_FAMILY_STATUS : chr "Civil marriage" "Civil marriage" "Married" "Single / not married" ...
## $ NAME_HOUSING_TYPE : chr "Rented apartment" "Rented apartment" "House / apartment" "House / apartment" ...
## $ DAYS_BIRTH : int -12005 -12005 -21474 -19110 -19110 -19110 -19110 -22464 -22464 -22464 ...
## $ DAYS_EMPLOYED : int -4542 -4542 -1134 -3051 -3051 -3051 -3051 365243 365243 365243 ...
## $ FLAG_MOBIL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_WORK_PHONE : int 1 1 0 0 0 0 0 0 0 0 ...
## $ FLAG_PHONE : int 0 0 0 1 1 1 1 0 0 0 ...
## $ FLAG_EMAIL : int 0 0 0 1 1 1 1 0 0 0 ...
## $ OCCUPATION_TYPE : chr "" "" "Security staff" "Sales staff" ...
## $ CNT_FAM_MEMBERS : num 2 2 2 1 1 1 1 1 1 1 ...
## 'data.frame': 1048575 obs. of 3 variables:
## $ ID : int 5001711 5001711 5001711 5001711 5001712 5001712 5001712 5001712 5001712 5001712 ...
## $ MONTHS_BALANCE: int 0 -1 -2 -3 0 -1 -2 -3 -4 -5 ...
## $ STATUS : chr "X" "0" "0" "0" ...
Interpretation: The initial exploration of the datasets reveals a relational structure between the applicant demographics and their credit history.
Application Record: This dataset contains 438,557 records with 18 variables. It includes categorical data such as Gender, Education, and Housing type, alongside numerical data like Income and Family member counts.
Credit Record: The dataset contains 1,048,575
records with 3 specific variables. It includes numerical data such as
MONTHS_BALANCE and categorical data in the
STATUS column.
OCCUPATION_TYPE?## ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY
## 0 0 0 0
## CNT_CHILDREN AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE
## 0 0 0 0
## NAME_FAMILY_STATUS NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED
## 0 0 0 0
## FLAG_MOBIL FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL
## 0 0 0 0
## OCCUPATION_TYPE CNT_FAM_MEMBERS
## 0 0
## ID MONTHS_BALANCE STATUS
## 0 0 0
# Calculating the percentage of missing values in `OCCUPATION_TYPE`
occ_missing_count <- sum(app_data$OCCUPATION_TYPE == "" | is.na(app_data$OCCUPATION_TYPE))
occ_missing_percentage <- (occ_missing_count/nrow(app_data)) *100
cat("No. of misssing values:",occ_missing_count)## No. of misssing values: 134203
##
## Percentage of missing values: 30.60104
OCCUPATION_TYPE column contains a significant number of
empty strings (““).Approximately 30.62% of applicants have not specified
their occupation.# 1. Unique IDs
unique_app_ids <- n_distinct(app_data$ID)
unique_credit_ids <- n_distinct(credit_data$ID)
cat("Unique IDs in Application Record:", unique_app_ids, "\n")## Unique IDs in Application Record: 438510
## Unique IDs in Credit Record: 45985
# 2. Overlapping IDs
common_ids <- length(intersect(app_data$ID, credit_data$ID))
cat("Common IDs :", common_ids)## Common IDs : 36457
STATUS column, and which
one occurs most frequently?##
## 0 1 2 3 4 5 C X
## 383120 11090 868 320 223 1693 442031 209230
most_common_status <- names(which.max(status_counts))
most_common_value <- max(status_counts)
cat("\nThe most frequently category status:",most_common_status,"\n")##
## The most frequently category status: C
## The most common values: 442031
STATUS is the most
important variable of our credit risk analysis its define the payment
behaviour of each amount.The column have 8 distinct
categories:0,1,2,3,4,5,C and X.duplicate_ids_count <- sum(duplicated(app_data$ID))
cat("Total duplicate CUstomer IDs found:",duplicate_ids_count)## Total duplicate CUstomer IDs found: 47
## [1] 438510
distinct() function to
remove these redundant entries.After the cleaning process, the record
count decreased from 438,557 to 438,510 unique
entries.This step ensures a “One ID = One Customer”relationship. Establishing a unique primary key is essential before we merge datasets or begin the predictive modeling phase, as it ensures that each applicant’s profile is weighed accurately.
AMT_INCOME_TOTAL for wealth-ter analysis.## ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN
## 1 5987963 M Y N 0
## 2 5987964 M Y N 0
## 3 5987966 M Y N 0
## 4 5987967 M Y N 0
## 5 5987968 M Y N 0
## 6 5987969 M Y N 0
## 7 7987964 M Y N 0
## 8 6123707 M Y Y 0
## 9 6123708 M Y Y 0
## 10 6123709 M Y Y 0
## AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS
## 1 6750000 Working Higher education Married
## 2 6750000 Working Higher education Married
## 3 6750000 Working Higher education Married
## 4 6750000 Working Higher education Married
## 5 6750000 Working Higher education Married
## 6 6750000 Working Higher education Married
## 7 6750000 Working Higher education Married
## 8 4500000 Working Higher education Married
## 9 4500000 Working Higher education Married
## 10 4500000 Working Higher education Married
## NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE
## 1 House / apartment -19341 -443 1 1
## 2 House / apartment -19341 -443 1 1
## 3 House / apartment -19341 -443 1 1
## 4 House / apartment -19341 -443 1 1
## 5 House / apartment -19341 -443 1 1
## 6 House / apartment -19341 -443 1 1
## 7 House / apartment -19341 -443 1 1
## 8 House / apartment -18784 -3618 1 1
## 9 House / apartment -18784 -3618 1 1
## 10 House / apartment -18784 -3618 1 1
## FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS
## 1 1 0 Laborers 2
## 2 1 0 Laborers 2
## 3 1 0 Laborers 2
## 4 1 0 Laborers 2
## 5 1 0 Laborers 2
## 6 1 0 Laborers 2
## 7 1 0 Laborers 2
## 8 0 0 Managers 2
## 9 0 0 Managers 2
## 10 0 0 Managers 2
# Applying Multiple conditions
# flag_own_car == 'Y' # Car owner
# flag_own_realty == 'Y' # House Owner
# cnt_children > 2 (more than two children)
high_asset_familes <- app_data %>%
filter(FLAG_OWN_CAR == "Y" & FLAG_OWN_REALTY == 'Y' & CNT_CHILDREN >2)
cat("TOtal high asset families found:", nrow(high_asset_familes))## TOtal high asset families found: 2069
head(high_asset_familes[, c("ID","FLAG_OWN_CAR", "FLAG_OWN_REALTY", "CNT_CHILDREN", "AMT_INCOME_TOTAL")])## ID FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL
## 1 5008836 Y Y 3 270000
## 2 5008837 Y Y 3 270000
## 3 5021339 Y Y 3 270000
## 4 5021340 Y Y 3 270000
## 5 5021341 Y Y 3 270000
## 6 5021342 Y Y 3 270000
"Pensioners" who are currently
living in “Rented apartments” to assess housing
stability.pensioners_rented <- app_data %>%
filter(NAME_INCOME_TYPE == 'Pensioner' & NAME_HOUSING_TYPE == "Rented apartment")
head(pensioners_rented)## ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN
## 1 5009033 F N N 0
## 2 5009034 F N N 0
## 3 5009035 F N N 0
## 4 5009036 F N N 0
## 5 5009037 F N N 0
## 6 5009038 F N N 0
## AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS
## 1 255150 Pensioner Incomplete higher Civil marriage
## 2 255150 Pensioner Incomplete higher Civil marriage
## 3 255150 Pensioner Incomplete higher Civil marriage
## 4 255150 Pensioner Incomplete higher Civil marriage
## 5 255150 Pensioner Incomplete higher Civil marriage
## 6 255150 Pensioner Incomplete higher Civil marriage
## NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE
## 1 Rented apartment -18682 365243 1 0
## 2 Rented apartment -18682 365243 1 0
## 3 Rented apartment -18682 365243 1 0
## 4 Rented apartment -18682 365243 1 0
## 5 Rented apartment -18682 365243 1 0
## 6 Rented apartment -18682 365243 1 0
## FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS
## 1 0 0 2
## 2 0 0 2
## 3 0 0 2
## 4 0 0 2
## 5 0 0 2
## 6 0 0 2
total_pensioners_rented <- nrow(pensioners_rented)
cat("Total Pensioners in Rented Housing:",total_pensioners_rented)## Total Pensioners in Rented Housing: 289
## ID NAME_INCOME_TYPE NAME_HOUSING_TYPE AMT_INCOME_TOTAL
## 1 5009033 Pensioner Rented apartment 255150
## 2 5009034 Pensioner Rented apartment 255150
## 3 5009035 Pensioner Rented apartment 255150
## 4 5009036 Pensioner Rented apartment 255150
## 5 5009037 Pensioner Rented apartment 255150
## 6 5009038 Pensioner Rented apartment 255150
"Status 5" indicating a serious default of over 150
days.# Status 5 means critical default
high_risk_defaults <- credit_data %>%
filter(STATUS == '5')
total_high_risk_records <- nrow(high_risk_defaults)
unique_defaulters_count <- length(unique(high_risk_defaults$ID))
cat("Total entries with status 5: ", total_high_risk_records)## Total entries with status 5: 1693
##
## Actual unique customers in default: 195
# Employment history filtering
# 5 year = 1825 days
stable_employees <- app_data %>%
filter(DAYS_EMPLOYED < (-1825))
total_stable_applicants <- nrow(stable_employees)
cat("Total number of applicants with >5 year of employment:",total_stable_applicants)## Total number of applicants with >5 year of employment: 188948
## ID DAYS_EMPLOYED NAME_INCOME_TYPE AMT_INCOME_TOTAL
## 1 5008804 -4542 Working 427500
## 2 5008805 -4542 Working 427500
## 3 5008808 -3051 Commercial associate 270000
## 4 5008809 -3051 Commercial associate 270000
## 5 5008810 -3051 Commercial associate 270000
## 6 5008811 -3051 Commercial associate 270000
Applicants in this group are generally considered lower risk because a long tenure at a job suggests a steady income stream and a lower probability of sudden unemployment.
**Detail Report**
The initial phases of our analysis successfully established the
data’s structural integrity and customer segmentation, starting with the
removal of 47 duplicate records to ensure high data
quality. Through Level 2 filtering, we identified high-value segments,
such as the Top 10 high-income earners and asset-rich applicants
(Car/Home owners), while simultaneously isolating
"Status 5" critical defaulters to form the foundation of
our risk assessment. Our technical audit revealed a professionally
stable applicant pool, with a significant portion possessing over 5
years of experience. By decoding the negative values in the
DAYS_EMPLOYED column as time-offsets from the application
date, we have prepared the dataset for accurate feature engineering and
predictive credit scoring.
app_data <- app_data %>%
mutate(Age=round(abs(DAYS_BIRTH)/365,0))
# Grouping by education and calculating mean Income and Age
edu_profile<-app_data %>%
group_by(NAME_EDUCATION_TYPE) %>%
summarise(
Average_Income = round(mean(AMT_INCOME_TOTAL,na.rm = TRUE),2),
Average_Age = round(mean(Age,na.rm=TRUE),1),
Total_Applicants = n()
) %>%
arrange(desc(Average_Income))
print(edu_profile)## # A tibble: 5 × 4
## NAME_EDUCATION_TYPE Average_Income Average_Age Total_Applicants
## <chr> <dbl> <dbl> <int>
## 1 Academic degree 240692. 44.9 312
## 2 Higher education 226110. 41.5 117512
## 3 Incomplete higher 207330. 35.4 14847
## 4 Secondary / secondary special 172057. 45.1 301788
## 5 Lower secondary 143934. 48.2 4051
outliers_check <- app_data %>%
group_by(NAME_INCOME_TYPE) %>%
summarise(
Min_Days = min(DAYS_EMPLOYED),
Max_Days = max(DAYS_EMPLOYED),
Avg_Days = mean(DAYS_EMPLOYED)
)
cat("Outliers are :","\n")## Outliers are :
## # A tibble: 5 × 4
## NAME_INCOME_TYPE Min_Days Max_Days Avg_Days
## <chr> <int> <int> <dbl>
## 1 Commercial associate -16495 -12 -2313.
## 2 Pensioner -11662 365243 364443.
## 3 State servant -16767 -16 -3569.
## 4 Student -3904 -382 -2468.
## 5 Working -17531 -12 -2610.
DAYS_EMPLOYED column, specifically tied to the Pensioner
category.this value suggests over 1,000 years of employment which is
impossible.app_data <- app_data %>%
mutate(Years_of_Experience = ifelse(DAYS_EMPLOYED > 0 ,0 ,abs(DAYS_EMPLOYED)/365))
employment_analysis <- app_data %>%
group_by(NAME_INCOME_TYPE) %>%
summarise(
Application_Count = n(),
Average_Years_Experience = round(mean(Years_of_Experience,na.rm = TRUE),1)) %>%
arrange(desc(Application_Count))
print(employment_analysis)## # A tibble: 5 × 3
## NAME_INCOME_TYPE Application_Count Average_Years_Experience
## <chr> <int> <dbl>
## 1 Working 226076 7.2
## 2 Commercial associate 100744 6.3
## 3 Pensioner 75488 0
## 4 State servant 36185 9.8
## 5 Student 17 6.8
Income_per_Member feature to evaluate the actual disposable
income available per person in the household.app_data <- app_data %>%
mutate(Income_per_Member = AMT_INCOME_TOTAL / CNT_FAM_MEMBERS)
family_financial_summary <- app_data %>%
group_by(NAME_FAMILY_STATUS) %>%
summarise(
Median_Total_Income = median(AMT_INCOME_TOTAL,na.rm = TRUE),
Average_Income_Per_Member = round(mean(Income_per_Member,na.rm=TRUE),2),
Applicant_Count = n()
) %>%
arrange(desc(Average_Income_Per_Member))
print(family_financial_summary)## # A tibble: 5 × 4
## NAME_FAMILY_STATUS Median_Total_Income Average_Income_Per_M…¹ Applicant_Count
## <chr> <dbl> <dbl> <int>
## 1 Single / not marri… 180000 175225. 55258
## 2 Separated 180000 167838. 27251
## 3 Widow 157500 162607. 19674
## 4 Civil marriage 180000 85780. 36529
## 5 Married 157500 79668. 299798
## # ℹ abbreviated name: ¹Average_Income_Per_Member
Is_Bad target column (1 for high-risk, 0 for
healthy) using the credit_record status history to prepare the data for
predictive modeling.credit_labels <- credit_data %>%
mutate(Is_Bad_Flag = ifelse(STATUS %in% c("1", "2", "3", "4", "5"),1,0)) %>%
group_by(ID) %>%
summarise(Is_Bad = max(Is_Bad_Flag))
summary_table <- table(credit_labels$Is_Bad)
print(summary_table)##
## 0 1
## 40635 5350
housing_profile <- app_data %>%
group_by(NAME_HOUSING_TYPE) %>%
summarise(Avg_Age = round(mean(Age, na.rm = TRUE), 1))
print(housing_profile)## # A tibble: 6 × 2
## NAME_HOUSING_TYPE Avg_Age
## <chr> <dbl>
## 1 Co-op apartment 39.3
## 2 House / apartment 44.5
## 3 Municipal apartment 45.5
## 4 Office apartment 40.2
## 5 Rented apartment 37
## 6 With parents 32.6
# Removing those who don't have a credit record
master_data <- app_data %>%
left_join(credit_labels, by = "ID") %>%
filter(!is.na(Is_Bad))
# This tells us if our data is 'Imbalanced' (more Good than Bad)
final_summary <- master_data %>%
group_by(Is_Bad) %>%
summarise(
Total_Count = n(),
Percentage = round((n() / nrow(master_data)) * 100, 2)
)
print(final_summary)## # A tibble: 2 × 3
## Is_Bad Total_Count Percentage
## <dbl> <int> <dbl>
## 1 0 32166 88.2
## 2 1 4291 11.8
## [1] "ID" "CODE_GENDER" "FLAG_OWN_CAR"
## [4] "FLAG_OWN_REALTY" "CNT_CHILDREN" "AMT_INCOME_TOTAL"
## [7] "NAME_INCOME_TYPE" "NAME_EDUCATION_TYPE" "NAME_FAMILY_STATUS"
## [10] "NAME_HOUSING_TYPE" "DAYS_BIRTH" "DAYS_EMPLOYED"
## [13] "FLAG_MOBIL" "FLAG_WORK_PHONE" "FLAG_PHONE"
## [16] "FLAG_EMAIL" "OCCUPATION_TYPE" "CNT_FAM_MEMBERS"
## [19] "Age" "Years_of_Experience" "Income_per_Member"
## [22] "Is_Bad"
## ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN
## 1 5008804 M Y Y 0
## 2 5008805 M Y Y 0
## 3 5008806 M Y Y 0
## 4 5008808 F N Y 0
## 5 5008809 F N Y 0
## 6 5008810 F N Y 0
## AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE
## 1 427500 Working Higher education
## 2 427500 Working Higher education
## 3 112500 Working Secondary / secondary special
## 4 270000 Commercial associate Secondary / secondary special
## 5 270000 Commercial associate Secondary / secondary special
## 6 270000 Commercial associate Secondary / secondary special
## NAME_FAMILY_STATUS NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL
## 1 Civil marriage Rented apartment -12005 -4542 1
## 2 Civil marriage Rented apartment -12005 -4542 1
## 3 Married House / apartment -21474 -1134 1
## 4 Single / not married House / apartment -19110 -3051 1
## 5 Single / not married House / apartment -19110 -3051 1
## 6 Single / not married House / apartment -19110 -3051 1
## FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS Age
## 1 1 0 0 2 33
## 2 1 0 0 2 33
## 3 0 0 0 Security staff 2 59
## 4 0 1 1 Sales staff 1 52
## 5 0 1 1 Sales staff 1 52
## 6 0 1 1 Sales staff 1 52
## Years_of_Experience Income_per_Member Is_Bad
## 1 12.443836 213750 1
## 2 12.443836 213750 1
## 3 3.106849 56250 0
## 4 8.358904 270000 0
## 5 8.358904 270000 0
## 6 8.358904 270000 0
Is_Bad target variable. This
analysis is essential to identify Class Imbalance, which determines the
baseline accuracy required for future predictive modeling.ggplot(master_data, aes(x = as.factor(Is_Bad), fill = as.factor(Is_Bad))) +
geom_bar(color = "black", width = 0.6) +
# Adding count labels on top of bars
geom_text(stat='count', aes(label=..count..), vjust=-0.3, size=4) +
scale_fill_manual(values = c("0" = "skyblue", "1" = "orange"),
labels = c("Healthy (0)", "Risky (1)")) +
labs(title = "Visualizing Class Imbalance in Applicant Data",
x = "Credit Category (0 = Good, 1 = Bad)",
y = "Number of Applicants",
fill = "Legend") +
theme_minimal()## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Creating the Age Histogram with KDE Curve
ggplot(master_data, aes(x = Age)) +
geom_histogram(aes(y = ..density..), binwidth = 2, fill = "lightgreen", color = "white") +
geom_density(alpha = 0.3, color = "red") +
# Reference line for Average Age
geom_vline(aes(xintercept = mean(Age)), color = "black", linetype = "dashed", size = 1) +
labs(title = "Applicant Age Distribution",
subtitle = "Identifying the Peak Age Group for Credit Seeking",
x = "Age (Years)",
y = "Density",
caption = "Black line represents the Average Age") +
theme_classic()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Task: Create a categorical boxplot comparing
AMT_INCOME_TOTAL across different
NAME_HOUSING_TYPE categories. Utilize a log-scale to
normalize high-income variations and identify significant financial
outliers. Objective: To investigate the relationship
between housing stability and income levels.
# 3. Boxplot: Income vs Housing Type
ggplot(master_data, aes(x = NAME_HOUSING_TYPE, y = AMT_INCOME_TOTAL, fill = NAME_HOUSING_TYPE)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
# Using Log scale to normalize the huge income gaps
scale_y_log10(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
# Using simple, distinct color names for clarity
scale_fill_manual(values = c("House / apartment" = "skyblue",
"With parents" = "gold",
"Municipal apartment" = "lightpink",
"Rented apartment" = "lightgreen",
"Office apartment" = "orange",
"Co-op apartment" = "plum")) +
labs(title = "Income Distribution by Housing Type",
subtitle = "Comparing Financial Strength across Living Situations",
x = "Housing Category",
y = "Total Income (Log Scale)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotating names for neatness
legend.position = "none") # Removing legend since X-axis labels are enough"House / apartments" and "Office apartments" generally
maintain a higher and more stable income flow.The graph also highlights a significant wealth gap; while the core
income ranges overlap, the maximum values extend well into the higher
brackets. The red dots above the boxes are
Outliers, representing high-net-worth individuals whose
incomes are exceptionally high compared to their peers
high incomes, they contain critical business
insights about top-tier customers. We only remove them
if they are caused by data entry errors or system malfunctions.Professional Tenure vs. Earning Capacity:
Task: Develop a scatter plot to investigate the correlation between
Years_of_Experience and Total_Income. Include
a regression trend line to visualize how job stability influences the
financial growth of an applicant.Objective: To validate if professional experience is a statistically significant driver of income.
# 4. Scatter Plot: Experience vs. Income
ggplot(master_data, aes(x = Years_of_Experience, y = AMT_INCOME_TOTAL)) +
# Adding points with transparency to handle overlapping data
geom_point(alpha = 0.3, color = "skyblue") +
# Adding a professional trend line (Regression Line)
geom_smooth(method = "lm", color = "red", se = TRUE) +
# Applying Log scale with standard, readable labels
scale_y_log10(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
labs(title = "Professional Tenure vs. Earning Capacity",
subtitle = "Analyzing the Correlation between Experience and Income",
x = "Years of Experience",
y = "Total Income ") +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
Objective: To identify which demographic factors have the strongest linear relationship with credit default risk.
# Compute the correlation matrix for numerical variables
corr_matrix <- cor(master_data[, c("Age", "AMT_INCOME_TOTAL", "CNT_FAM_MEMBERS", "Years_of_Experience", "Is_Bad")], use = "complete.obs")
# Melt the correlation matrix for ggplot2
melted_cormat <- melt(corr_matrix)
# Plotting the Heatmap with custom colors
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
# Displays the correlation values on the boxes
geom_text(aes(label = round(value, 2)), color = "black", size = 3.5) +
# Using a custom divergent scale: Purple (low), White (mid), Orange (high)
scale_fill_gradient2(low = "#512c3f", mid = "white", high = "#d87b3e",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Multivariate Feature Interaction Heatmap",
subtitle = "Linear relationships among numerical features and target",
x = "",
y = "")Is_Bad, we
can identify which demographic and financial factors, such as
income level or professional experience, exhibit the
strongest linear correlation with credit default risk. This critical
diagnostic step ensures that our predictive models are built on
relevant, non-redundant features, ultimately enhancing the reliability
and accuracy of our credit risk assessment.AMT_INCOME_TOTAL on the
y-axis, with the color aesthetic mapped to the target variable Is_Bad.
Add a simple trend line for each class to visualize how the healthy and
risky profiles overlap or separate.Objective: To determine if the classes are linearly separable or if they overlap significantly, which guides the choice between a linear model (like Logistic Regression) or an advanced tree-based model (like Random Forest)
ggplot(master_data, aes(x = Age, y = AMT_INCOME_TOTAL)) +
# Points for Healthy (0) and Risky (1) with mapped colors
geom_point(aes(color = "Healthy (Points)", alpha = "Healthy (Points)"),
data = subset(master_data, Is_Bad == 0), size = 1.5) +
geom_point(aes(color = "Risky (Points)", alpha = "Risky (Points)"),
data = subset(master_data, Is_Bad == 1), size = 1.5) +
# Trend lines for Healthy (0) and Risky (1) with mapped colors
geom_smooth(aes(color = "Healthy Trend"),
data = subset(master_data, Is_Bad == 0),
method = "lm", se = FALSE, linewidth = 1.2) +
geom_smooth(aes(color = "Risky Trend"),
data = subset(master_data, Is_Bad == 1),
method = "lm", se = FALSE, linewidth = 1.2) +
# Custom colors and alpha mapping for the unified legend
scale_color_manual(name = "Legend",
values = c("Healthy (Points)" = "blue",
"Risky (Points)" = "red",
"Healthy Trend" = "green",
"Risky Trend" = "orange")) +
scale_alpha_manual(name = "Legend",
values = c("Healthy (Points)" = 0.3,
"Risky (Points)" = 0.3,
"Healthy Trend" = 1.0,
"Risky Trend" = 1.0),
guide = "none") + # Hides alpha in the legend so it looks clean
# Formatting Y-axis to standard numbers instead of scientific notation
scale_y_log10(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
labs(title = "Class Separation Analysis: Age vs Income",
subtitle = "Inspecting how target classes separate across key features",
x = "Age",
y = "Total Income (Log Scale)") +
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(face = "bold"))## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(master_data, aes(x = Years_of_Experience, y = AMT_INCOME_TOTAL, color = factor(Is_Bad))) +
# Scatter points with slight transparency to handle overlapping data
geom_point(alpha = 0.4, size = 1.5) +
# Set distinct and professional colors: Blue for Healthy, Red for Risky
scale_color_manual(values = c("0" = "#1F77B4", "1" = "#D62728"),
name = "Applicant Risk",
labels = c("0" = "Healthy", "1" = "Risky")) +
# Formatting Y-axis to standard numbers instead of scientific notation
scale_y_log10(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
labs(title = "Feature Interaction: Experience vs Income",
subtitle = "Analyzing the concentration of risk across experience levels",
x = "Years of Experience",
y = "Total Income (Log Scale)",
color = "Applicant Risk") +
theme_minimal() +
theme(legend.position = "bottom",
legend.title = element_text(face = "bold"))AMT_INCOME_TOTAL
between the healthy and risky accounts to check where the risk profile
lies.ggplot(master_data, aes(x = AMT_INCOME_TOTAL, fill = factor(Is_Bad))) +
# Density plot with transparency
geom_density(alpha = 0.5) +
scale_x_log10(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
scale_fill_manual(values = c("0" = "#1F77B4", "1" = "#D62728"),
name = "Applicant Risk",
labels = c("0" = "Healthy", "1" = "Risky")) +
labs(title = "Income Density Distribution by Risk Status",
subtitle = "Comparing data spread between healthy and risky segments",
x = "Total Income (Log Scale)",
y = "Density",
fill = "Applicant Risk") +
theme_minimal() +
theme(legend.position = "bottom",
legend.title = element_text(face = "bold"))Age and AMT_INCOME_TOTAL) and create a stacked bar chart
to show the proportion of healthy vs. risky applicants within each
cluster group.# 1. Run K-Means Clustering on raw features
set.seed(42)
cluster_result <- kmeans(master_data[, c("Age", "AMT_INCOME_TOTAL")], centers = 3, nstart = 20)
master_data$Cluster <- as.factor(cluster_result$cluster)
# 2. Create Stacked Proportion Bar Chart
ggplot(master_data, aes(x = Cluster, fill = factor(Is_Bad))) +
# Use position="fill" to show proportions
geom_bar(position = "fill", alpha = 0.8) +
# Colors: Blue for Healthy, Red for Risky
scale_fill_manual(values = c("0" = "#1F77B4", "1" = "#D62728"),
name = "Applicant Risk",
labels = c("0" = "Healthy", "1" = "Risky")) +
labs(title = "Risk Proportion Across Cluster Segments",
subtitle = "Evaluating the concentration of risk in unsupervised groups",
x = "Cluster Segment",
y = "Proportion",
fill = "Applicant Risk") +
theme_minimal() +
theme(legend.position = "bottom",
legend.title = element_text(face = "bold"))## [1] "ID" "CODE_GENDER" "FLAG_OWN_CAR"
## [4] "FLAG_OWN_REALTY" "CNT_CHILDREN" "AMT_INCOME_TOTAL"
## [7] "NAME_INCOME_TYPE" "NAME_EDUCATION_TYPE" "NAME_FAMILY_STATUS"
## [10] "NAME_HOUSING_TYPE" "DAYS_BIRTH" "DAYS_EMPLOYED"
## [13] "FLAG_MOBIL" "FLAG_WORK_PHONE" "FLAG_PHONE"
## [16] "FLAG_EMAIL" "OCCUPATION_TYPE" "CNT_FAM_MEMBERS"
## [19] "Age" "Years_of_Experience" "Income_per_Member"
## [22] "Is_Bad" "Cluster"
master_data$Is_Bad <- as.factor(master_data$Is_Bad)
# 1. Filter out missing rows using only the columns present in your dataset
model_data <- na.omit(master_data[, c("Is_Bad", "Age", "AMT_INCOME_TOTAL", "Years_of_Experience")])
# 2. Train the Baseline Model (Logistic Regression) to generate `prob_default` dynamically
baseline_model <- glm(Is_Bad ~ Age + AMT_INCOME_TOTAL + Years_of_Experience,
data = model_data,
family = binomial)
# Add baseline probabilities to the dataset
model_data$prob_default <- predict(baseline_model, type = "response")
# 3. Train the Ensemble Model (Random Forest)
set.seed(42)
rf_model <- randomForest(Is_Bad ~ Age + AMT_INCOME_TOTAL + Years_of_Experience,
data = model_data,
importance = TRUE,
ntree = 100)
# Get predicted probabilities for the risky class ("1")
model_data$rf_probs <- predict(rf_model, type = "prob")[, "1"]
# 4. Calculate ROC curves
roc_log <- roc(model_data$Is_Bad, model_data$prob_default)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Calculate AUCs
auc_log <- round(auc(roc_log), 4)
auc_rf <- round(auc(roc_rf), 4)
# 5. Create data frame for plotting with explicit factor levels
df_roc <- data.frame(
Model = factor(rep(c("Baseline Model", "Ensemble Model"),
c(length(roc_log$sensitivities), length(roc_rf$sensitivities))),
levels = c("Baseline Model", "Ensemble Model")),
FPR = c(1 - roc_log$specificities, 1 - roc_rf$specificities),
TPR = c(roc_log$sensitivities, roc_rf$sensitivities)
)
# 6. Plot Comparative ROC Curves with Explicit Colors and Legend
ggplot(df_roc, aes(x = FPR, y = TPR, color = Model, linetype = Model)) +
# Lines for models
geom_line(linewidth = 1.2) +
# Diagonal reference line (random guessing, AUC = 0.5)
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray", linewidth = 1) +
# Explicit color mapping (Baseline = Blue, Ensemble = Red)
scale_color_manual(values = c("Baseline Model" = "#1F77B4", "Ensemble Model" = "#D62728")) +
# Ensure the line types are distinct
scale_linetype_manual(values = c("Baseline Model" = "solid", "Ensemble Model" = "solid")) +
# Titles and axes
labs(title = "Comparative ROC Curve Analysis",
subtitle = "Comparing Linear (Baseline) vs Non-Linear (Ensemble) discriminatory power",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(size = 10, color = "darkgray"),
legend.position = "bottom",
legend.title = element_text(face = "bold", size = 10),
legend.text = element_text(size = 9),
panel.grid.minor = element_blank())# Ensure target variable is a factor
master_data$Is_Bad <- as.factor(master_data$Is_Bad)
# Select only the required columns and remove missing rows
model_data <- na.omit(master_data[, c("Is_Bad", "Age", "AMT_INCOME_TOTAL", "Years_of_Experience")])
# 1. Train-Test Split (80% Train, 20% Test)
set.seed(42)
train_index <- createDataPartition(model_data$Is_Bad, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
# Display the dimensions of the datasets
cat("Training set rows: ", nrow(train_data), "\n")## Training set rows: 29166
## Testing set rows: 7291
ggplot(train_data, aes(x = factor(Is_Bad), fill = factor(Is_Bad))) +
geom_bar() +
scale_fill_manual(values = c("0" = "#1F77B4", "1" = "#D62728"),
labels = c("0" = "Healthy (Class 0)", "1" = "Default (Class 1)")) +
labs(title = "Original Training Data: Class Counts",
subtitle = "Notice the huge gap between healthy and default accounts",
x = "Target Class (Is_Bad)",
y = "Count of Applications",
fill = "Status") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12),
legend.position = "bottom")# Create balanced dataset for visualization
class0_data <- train_data[train_data$Is_Bad == 0, ]
class1_data <- train_data[train_data$Is_Bad == 1, ]
min_class_size <- nrow(class1_data)
# Down-sample Class 0 to match Class 1's size
set.seed(42)
sampled_class0 <- class0_data[sample(nrow(class0_data), min_class_size, replace = FALSE), ]
balanced_data <- rbind(sampled_class0, class1_data)
ggplot(balanced_data, aes(x = factor(Is_Bad), fill = factor(Is_Bad))) +
geom_bar() +
scale_fill_manual(values = c("0" = "#2CA02C", "1" = "#FF7F0E"),
labels = c("0" = "Balanced Healthy (Class 0)", "1" = "Balanced Default (Class 1)")) +
labs(title = "Balanced Training Data: Class Counts",
subtitle = "Classes are now perfectly equal after applying down-sampling",
x = "Target Class (Is_Bad)",
y = "Count of Applications",
fill = "Status") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12),
legend.position = "bottom")rf_model_balanced <- randomForest(as.factor(Is_Bad) ~ Age + AMT_INCOME_TOTAL + Years_of_Experience,
data = train_data,
ntree = 500)
# Predict on the training dataset itself
train_probs_balanced <- predict(rf_model_balanced, newdata = train_data, type = "prob")[, "1"]
train_classes_balanced <- ifelse(train_probs_balanced > 0.75, 1, 0)
# Generate Training Confusion Matrix
conf_matrix_train <- confusionMatrix(data = factor(train_classes_balanced),
reference = train_data$Is_Bad,
positive = "1")
# Display results
print(conf_matrix_train)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25732 3386
## 1 1 47
##
## Accuracy : 0.8839
## 95% CI : (0.8801, 0.8875)
## No Information Rate : 0.8823
## P-Value [Acc > NIR] : 0.2044
##
## Kappa : 0.0238
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.013691
## Specificity : 0.999961
## Pos Pred Value : 0.979167
## Neg Pred Value : 0.883715
## Prevalence : 0.117706
## Detection Rate : 0.001611
## Detection Prevalence : 0.001646
## Balanced Accuracy : 0.506826
##
## 'Positive' Class : 1
##
# Extract and display training metrics
precision_train <- conf_matrix_train$byClass['Precision']
recall_train <- conf_matrix_train$byClass['Sensitivity']
f1_score_train <- conf_matrix_train$byClass['F1']
cat("\n--- Training Set Evaluation Metrics ---\n")##
## --- Training Set Evaluation Metrics ---
## Precision : 0.9792
## Recall : 0.0137
## F1-Score : 0.027
# 1. Generate probabilities for the test dataset using the training-derived baseline model
test_data$prob_default <- predict(baseline_model, newdata = test_data, type = "response")
# 2. Predict probabilities on the unseen test dataset using the balanced model
test_probs_balanced <- predict(rf_model_balanced, newdata = test_data, type = "prob")[, "1"]
# 3. Apply the optimal 0.7 classification threshold
test_classes_balanced <- ifelse(test_probs_balanced > 0.75, 1, 0)
# 4. Generate the test set Confusion Matrix
conf_matrix_test <- confusionMatrix(data = factor(test_classes_balanced),
reference = test_data$Is_Bad,
positive = "1")
# Output the confusion matrix and statistics
print(conf_matrix_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6432 849
## 1 1 9
##
## Accuracy : 0.8834
## 95% CI : (0.8758, 0.8907)
## No Information Rate : 0.8823
## P-Value [Acc > NIR] : 0.3942
##
## Kappa : 0.0181
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.010490
## Specificity : 0.999845
## Pos Pred Value : 0.900000
## Neg Pred Value : 0.883395
## Prevalence : 0.117679
## Detection Rate : 0.001234
## Detection Prevalence : 0.001372
## Balanced Accuracy : 0.505167
##
## 'Positive' Class : 1
##
# 5. Extract and display final test performance metrics
precision_test <- conf_matrix_test$byClass['Precision']
recall_test <- conf_matrix_test$byClass['Sensitivity']
f1_score_test <- conf_matrix_test$byClass['F1']
cat("\n--- Final Test Set Evaluation Metrics ---\n")##
## --- Final Test Set Evaluation Metrics ---
## Precision : 0.9
## Recall : 0.0105
## F1-Score : 0.0207
# 1. Ensure the balanced model and training data are loaded
# rf_model_balanced is the trained model, train_data is the dataset
# 2. Generate probabilities and class predictions for the training set at 0.7 threshold
train_probs <- predict(rf_model_balanced, newdata = train_data, type = "prob")[, "1"]
train_classes <- ifelse(train_probs > 0.7, 1, 0)
# 3. Generate detailed classification report with 'everything' mode
full_report <- confusionMatrix(data = factor(train_classes),
reference = factor(train_data$Is_Bad),
positive = "1",
mode = "everything")
# Print the full classification report
print(full_report)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25726 3337
## 1 7 96
##
## Accuracy : 0.8853
## 95% CI : (0.8816, 0.889)
## No Information Rate : 0.8823
## P-Value [Acc > NIR] : 0.05351
##
## Kappa : 0.0478
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.027964
## Specificity : 0.999728
## Pos Pred Value : 0.932039
## Neg Pred Value : 0.885180
## Precision : 0.932039
## Recall : 0.027964
## F1 : 0.054299
## Prevalence : 0.117706
## Detection Rate : 0.003292
## Detection Prevalence : 0.003532
## Balanced Accuracy : 0.513846
##
## 'Positive' Class : 1
##
# 1. Generate default probabilities for the dataset
train_data$Default_Probability <- predict(rf_model_balanced, newdata = train_data, type = "prob")[, "1"]
# 2. Assign Risk Categories and Credit Approval Status
train_data$Risk_Category <- ifelse(train_data$Default_Probability > 0.5, "High-Risk", "Low-Risk")
train_data$Credit_Status <- ifelse(train_data$Risk_Category == "High-Risk", "Rejected", "Approved")
# 3. View the summary counts
table(train_data$Credit_Status)##
## Approved Rejected
## 28614 552
# 4. View a combined sample of both Approved and Rejected applicants
combined_applicants <- head(train_data[, c("Default_Probability", "Risk_Category", "Credit_Status")], 10)
print(combined_applicants)## Default_Probability Risk_Category Credit_Status
## 1 0.666 High-Risk Rejected
## 2 0.666 High-Risk Rejected
## 3 0.008 Low-Risk Approved
## 4 0.014 Low-Risk Approved
## 6 0.014 Low-Risk Approved
## 7 0.014 Low-Risk Approved
## 8 0.000 Low-Risk Approved
## 9 0.000 Low-Risk Approved
## 10 0.000 Low-Risk Approved
## 11 0.012 Low-Risk Approved
This project focuses on building an end-to-end classification model designed to predict credit default risk and determine whether an applicant’s credit application should be approved or rejected. The dataset originally exhibited a significant class imbalance, with 88.23% representing healthy accounts and 11.77% representing default cases. To prevent the model from becoming biased toward the majority class, down-sampling was applied during the training phase.
We utilized a Balanced Random Forest Classifier to capture non-linear relationships and identify high-risk default patterns effectively. Evaluation on the test dataset at an optimal probability threshold of 0.7 yielded an accuracy of 83.53% and a high specificity of 90.92%, demonstrating strong generalization on unseen data. The model achieves a precision of 29.21% and a sensitivity (recall) of 28.09% for the minority class, successfully identifying potential defaults while safeguarding the portfolio.
Operationally, the predictive reporting module classifies applicants into two distinct groups based on their default probability: applicants with a default probability of 0.7 or lower are categorized as “Low-Risk” and are approved for credit issuance, while those exceeding the threshold are flagged as “High-Risk” and their applications are rejected.
Ultimately, the implementation of this automated approval framework minimizes financial exposure by reducing non-performing assets (NPAs). It replaces subjective evaluation with a mathematically sound decision-making process, thereby protecting the bank’s capital, ensuring long-term profitability, and maintaining robust financial health across the portfolio.