| Name | Student ID |
|---|---|
| Ang Jian Wei | 22060578 |
| Chew Yong Khang | 22060597 |
| Hoo Qin Ni | 22105331 |
| Liew Yao Qin | 22110206 |
| Mah Seau Sher | 22115483 |
To use machine learning techniques assist bank employees in reviewing loan eligibility by analyzing customer data details submitted through the online application form. The details filled by customer are gender, marital status, education, income, employment experience, person employment experience, person home ownership, loan amount request, loan intent and credit score.
To tackle a loan approval classification project using both regression and classification models, we using OSEMN framework, which stands for obtain, scrub, explore, model and interpret.
Loan Approval Classification Dataset last updated in October
2024.
https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/data
The dataset can be used for multiple purposes such as Exploratory Data
Analysis (EDA), Classification and Regression.
This dataset contains 45000 rows and 14 columns.
| Columns | Description | Type |
|---|---|---|
| person_age | Age of the person | Float |
| person_gender | Gender of the person | Categorical |
| person_education | Highest education level | Categorical |
| person_income | Annual income | Float |
| person_emp_exp | Years of employment experience | Integer |
| person_home_ownership | Home ownership status (rent, own, mortgage) | Categorical |
| loan_amnt | Loan amount requested | Float |
| loan_intent | Purpose of the loan | Categorical |
| loan_int_rate | Loan interest rate | Float |
| loan_percent_income | Loan amount as a percentage of annual income | Float |
| cb_person_cred_hist_length | Length of credit history in years | Float |
| credit_score | Credit score of the person | Integer |
| previous_loan_defaults_on_file | Indicator of previous loan defaults | Categorical |
| loan_status (target variable) | Loan approval status: 1 = approved; 0 = rejected | Integer |
This dataset offers a valuable foundation for analyzing financial risk factors and conducting predictive modeling for loan approvals and credit scoring.
For Loop: Iterates over each package name in the packages vector.
require(pkg, character.only = TRUE):
install.packages(pkg, dependencies = TRUE):
library(pkg, character.only = TRUE):
packages <- c(
"tidyverse", "janitor", "caret", "rpart", "rpart.plot",
"randomForest", "gbm", "kknn", "fastDummies", "MLmetrics", "corrplot",
"PerformanceAnalytics", "kernlab", "ROSE", "pROC"
)
for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
}
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: janitor
##
##
## Attaching package: 'janitor'
##
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
##
## Loading required package: caret
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
## Loading required package: rpart
##
## Loading required package: rpart.plot
##
## Loading required package: randomForest
##
## randomForest 4.7-1.2
##
## Type rfNews() to see new features/changes/bug fixes.
##
##
## Attaching package: 'randomForest'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
##
## Loading required package: gbm
##
## Loaded gbm 2.2.2
##
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
##
## Loading required package: kknn
##
##
## Attaching package: 'kknn'
##
##
## The following object is masked from 'package:caret':
##
## contr.dummy
##
##
## Loading required package: fastDummies
##
## Loading required package: MLmetrics
##
##
## Attaching package: 'MLmetrics'
##
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
##
## The following object is masked from 'package:base':
##
## Recall
##
##
## Loading required package: corrplot
##
## corrplot 0.95 loaded
##
## Loading required package: PerformanceAnalytics
##
## Loading required package: xts
##
## Loading required package: zoo
##
##
## Attaching package: 'zoo'
##
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
##
## Attaching package: 'xts'
##
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
##
##
## Attaching package: 'PerformanceAnalytics'
##
##
## The following object is masked from 'package:graphics':
##
## legend
##
##
## Loading required package: kernlab
##
##
## Attaching package: 'kernlab'
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
##
## Loading required package: ROSE
##
## Loaded ROSE 0.0-4
##
##
## Loading required package: pROC
##
## Type 'citation("pROC")' for a citation.
##
##
## Attaching package: 'pROC'
##
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Reads a CSV file located at the specific computer location.
Standardizes column names by converting them to lowercase, removing special characters, and replacing spaces with underscores for easier reference.
The mutate() function from dplyr is used to modify or create new
columns. It transforms several columns into factor variables
(categorical data), which are useful for statistical modeling or data
analysis such as converts loan_status into a factor with two
levels:
* “Not Approved”: Represents one level.
* “Approved”: Represents the other level.
data <- read_csv("/Users/qinnihoo/Documents/R/loan_data.csv") %>%
clean_names() %>%
mutate(
loan_status = factor(loan_status, labels = c("Not Approved", "Approved")),
person_gender = as.factor(person_gender),
person_education = as.factor(person_education),
person_home_ownership = as.factor(person_home_ownership),
loan_intent = as.factor(loan_intent),
previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
)
## Rows: 45000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): person_gender, person_education, person_home_ownership, loan_intent...
## dbl (9): person_age, person_income, person_emp_exp, loan_amnt, loan_int_rate...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There is no missing values in this dataset.
There is no null values in this dataset.
There is no duplicate rows in this dataset.
explore_data <- function(data) {
cat("--- Data Overview ---")
print(glimpse(data))
print(summary(data))
# --- Check for Missing Values ---
cat("--- Missing Value Analysis ---")
missing_values <- data %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(Total_Rows = nrow(data), Missing_Percentage = (Missing_Count / Total_Rows) * 100) %>%
arrange(desc(Missing_Count))
print(missing_values)
# --- Check for Null Values ---
cat("--- Null Value Analysis ---")
null_values <- sapply(data, function(x) sum(is.null(x)))
null_values <- data.frame(Column = names(null_values), Null_Count = null_values)
print(null_values)
# --- Check for Duplicate Rows ---
cat("--- Duplicate Row Analysis ---")
duplicate_rows <- data[duplicated(data), ]
cat("Number of duplicate rows: ", nrow(duplicate_rows), "")
if (nrow(duplicate_rows) > 0) {
print(head(duplicate_rows, 5)) # Show first 5 duplicate rows
}
}
# Run Data Exploration
explore_data(data)
## --- Data Overview ---Rows: 45,000
## Columns: 14
## $ person_age <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21,…
## $ person_gender <fct> female, female, female, female, male, f…
## $ person_education <fct> Master, High School, High School, Bache…
## $ person_income <dbl> 71948, 12282, 12438, 79753, 66135, 1295…
## $ person_emp_exp <dbl> 0, 0, 3, 0, 1, 0, 1, 5, 3, 0, 0, 0, 3, …
## $ person_home_ownership <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, R…
## $ loan_amnt <dbl> 35000, 1000, 5500, 35000, 35000, 2500, …
## $ loan_intent <fct> PERSONAL, EDUCATION, MEDICAL, MEDICAL, …
## $ loan_int_rate <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14…
## $ loan_percent_income <dbl> 0.49, 0.08, 0.44, 0.44, 0.53, 0.19, 0.3…
## $ cb_person_cred_hist_length <dbl> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, …
## $ credit_score <dbl> 561, 504, 635, 675, 586, 532, 701, 585,…
## $ previous_loan_defaults_on_file <fct> No, Yes, No, No, No, No, No, No, No, No…
## $ loan_status <fct> Approved, Not Approved, Approved, Appro…
## # A tibble: 45,000 × 14
## person_age person_gender person_education person_income person_emp_exp
## <dbl> <fct> <fct> <dbl> <dbl>
## 1 22 female Master 71948 0
## 2 21 female High School 12282 0
## 3 25 female High School 12438 3
## 4 23 female Bachelor 79753 0
## 5 24 male Master 66135 1
## 6 21 female High School 12951 0
## 7 26 female Bachelor 93471 1
## 8 24 female High School 95550 5
## 9 24 female Associate 100684 3
## 10 21 female High School 12739 0
## # ℹ 44,990 more rows
## # ℹ 9 more variables: person_home_ownership <fct>, loan_amnt <dbl>,
## # loan_intent <fct>, loan_int_rate <dbl>, loan_percent_income <dbl>,
## # cb_person_cred_hist_length <dbl>, credit_score <dbl>,
## # previous_loan_defaults_on_file <fct>, loan_status <fct>
## person_age person_gender person_education person_income
## Min. : 20.00 female:20159 Associate :12028 Min. : 8000
## 1st Qu.: 24.00 male :24841 Bachelor :13399 1st Qu.: 47204
## Median : 26.00 Doctorate : 621 Median : 67048
## Mean : 27.76 High School:11972 Mean : 80319
## 3rd Qu.: 30.00 Master : 6980 3rd Qu.: 95789
## Max. :144.00 Max. :7200766
## person_emp_exp person_home_ownership loan_amnt
## Min. : 0.00 MORTGAGE:18489 Min. : 500
## 1st Qu.: 1.00 OTHER : 117 1st Qu.: 5000
## Median : 4.00 OWN : 2951 Median : 8000
## Mean : 5.41 RENT :23443 Mean : 9583
## 3rd Qu.: 8.00 3rd Qu.:12237
## Max. :125.00 Max. :35000
## loan_intent loan_int_rate loan_percent_income
## DEBTCONSOLIDATION:7145 Min. : 5.42 Min. :0.0000
## EDUCATION :9153 1st Qu.: 8.59 1st Qu.:0.0700
## HOMEIMPROVEMENT :4783 Median :11.01 Median :0.1200
## MEDICAL :8548 Mean :11.01 Mean :0.1397
## PERSONAL :7552 3rd Qu.:12.99 3rd Qu.:0.1900
## VENTURE :7819 Max. :20.00 Max. :0.6600
## cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## Min. : 2.000 Min. :390.0 No :22142
## 1st Qu.: 3.000 1st Qu.:601.0 Yes:22858
## Median : 4.000 Median :640.0
## Mean : 5.867 Mean :632.6
## 3rd Qu.: 8.000 3rd Qu.:670.0
## Max. :30.000 Max. :850.0
## loan_status
## Not Approved:35000
## Approved :10000
##
##
##
##
## --- Missing Value Analysis ---# A tibble: 14 × 4
## Column Missing_Count Total_Rows Missing_Percentage
## <chr> <int> <int> <dbl>
## 1 person_age 0 45000 0
## 2 person_gender 0 45000 0
## 3 person_education 0 45000 0
## 4 person_income 0 45000 0
## 5 person_emp_exp 0 45000 0
## 6 person_home_ownership 0 45000 0
## 7 loan_amnt 0 45000 0
## 8 loan_intent 0 45000 0
## 9 loan_int_rate 0 45000 0
## 10 loan_percent_income 0 45000 0
## 11 cb_person_cred_hist_length 0 45000 0
## 12 credit_score 0 45000 0
## 13 previous_loan_defaults_on_file 0 45000 0
## 14 loan_status 0 45000 0
## --- Null Value Analysis --- Column Null_Count
## person_age person_age 0
## person_gender person_gender 0
## person_education person_education 0
## person_income person_income 0
## person_emp_exp person_emp_exp 0
## person_home_ownership person_home_ownership 0
## loan_amnt loan_amnt 0
## loan_intent loan_intent 0
## loan_int_rate loan_int_rate 0
## loan_percent_income loan_percent_income 0
## cb_person_cred_hist_length cb_person_cred_hist_length 0
## credit_score credit_score 0
## previous_loan_defaults_on_file previous_loan_defaults_on_file 0
## loan_status loan_status 0
## --- Duplicate Row Analysis ---Number of duplicate rows: 0
visualize_data <- function(data) {
# --- Distributions for All Columns ---
cat("--- Visualizing Distributions for All Columns ---")
# Identify numerical and categorical columns
numeric_columns <- data %>% select(where(is.numeric)) %>% names()
categorical_columns <- data %>% select(where(is.factor)) %>% names()
# Visualize numerical columns
for (col in numeric_columns) {
print(
ggplot(data, aes_string(x = col)) +
geom_histogram(bins = 30, color = "black", fill = "blue", alpha = 0.7) +
labs(
title = paste("Distribution of", col),
x = col,
y = "Frequency"
) +
theme_minimal()
)
}
# Visualize categorical columns
for (col in categorical_columns) {
print(
ggplot(data, aes_string(x = col)) +
geom_bar(color = "black", fill = "orange", alpha = 0.7) +
labs(
title = paste("Distribution of", col),
x = col,
y = "Count"
) +
theme_minimal()
)
}
# --- Categorical Variables and Loan Status ---
cat("--- Visualizing Categorical Variables with Loan Status ---")
# Loan Status by Gender
print(
ggplot(data, aes(x = person_gender, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(title = "Loan Status by Gender", x = "Gender", y = "Count")
)
cat("A higher number of loans were not approved for both males and females, but males had more approvals compared to females.")
# Loan Status by Education Level
print(
ggplot(data, aes(x = person_education, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(title = "Loan Status by Education Level", x = "Education Level", y = "Count")
)
cat("Bachelor's and Associate degree holders had the highest number of loan applications, with more loans not approved than approved across all education levels. However, Doctorate holders show relatively higher approval rates compared to other groups.")
# Loan Status by Home Ownership
print(
ggplot(data, aes(x = person_home_ownership, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(title = "Loan Status by Home Ownership", x = "Home Ownership", y = "Count")
)
cat("Applicants who rent or have mortgages make up the majority of applications, with more loans not approved than approved in these groups. Homeowners (OWN) had a smaller number of applications but relatively higher approval rates.")
# Loan Status by Loan Intent
print(
ggplot(data, aes(x = loan_intent, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(title = "Loan Status by Loan Intent", x = "Loan Intent", y = "Count")
)
cat("Debt consolidation and personal loans have the highest number of applications, with significantly more rejections. Educational loans have relatively fewer applications and higher approval rates.")
# Loan Status by Previous Loan Defaults
print(
ggplot(data, aes(x = previous_loan_defaults_on_file, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(title = "Loan Status by Previous Loan Defaults", x = "Previous Loan Defaults", y = "Count")
)
cat("Applicants with previous loan defaults had far more rejections compared to approvals, while those without defaults had better approval rates.")
# --- Numerical Variables and Loan Status ---
cat("\n--- Visualizing Numerical Variables with Loan Status ---\n")
# Distribution of Age by Loan Status
print(
ggplot(data, aes(x = person_age, fill = loan_status)) +
geom_histogram(bins = 20, color = "black", position = "dodge") +
labs(title = "Distribution of Age by Loan Status", x = "Age", y = "Count")
)
cat("Most applicants are younger, with a peak in the 20–40 age range. Across all ages, there are more loans not approved than approved.")
# Distribution of Income by Loan Status
print(
ggplot(data, aes(x = person_income, fill = loan_status)) +
geom_histogram(bins = 30, color = "black", position = "dodge") +
labs(title = "Distribution of Income by Loan Status", x = "Income", y = "Count")
)
cat("The majority of applicants have low incomes, with very few high-income individuals. Loans are more frequently rejected at lower income levels.")
# Boxplot of Income by Loan Status
print(
ggplot(data, aes(x = loan_status, y = person_income, fill = loan_status)) +
geom_boxplot() +
labs(title = "Income by Loan Status", x = "Loan Status", y = "Income")
)
cat("The median income for both approved and not-approved loans is similar, but there are outliers with very high incomes, especially among those not approved.")
# Distribution of Loan Amount by Loan Status
print(
ggplot(data, aes(x = loan_amnt, fill = loan_status)) +
geom_histogram(bins = 30, color = "black", position = "dodge") +
labs(title = "Distribution of Loan Amount by Loan Status", x = "Loan Amount", y = "Count")
)
cat("Most loan amounts are concentrated in the range of 0–15,000. Loans with higher amounts are less frequent, and rejections dominate across all loan sizes.")
# Boxplot of Loan Amount by Loan Status
print(
ggplot(data, aes(x = loan_status, y = loan_amnt, fill = loan_status)) +
geom_boxplot() +
labs(title = "Loan Amount by Loan Status", x = "Loan Status", y = "Loan Amount")
)
cat("The median loan amount is slightly higher for approved loans compared to not approved loans. However, there is significant overlap in the distributions, with a wide range of amounts for both groups.")
# Distribution of Interest Rate by Loan Status
print(
ggplot(data, aes(x = loan_int_rate, fill = loan_status)) +
geom_histogram(bins = 30, color = "black", position = "dodge") +
labs(title = "Distribution of Interest Rate by Loan Status", x = "Interest Rate", y = "Count")
)
cat("Interest rates primarily cluster around the range of 8–12%, with more loans rejected than approved across all interest rates. The rejection rate decreases slightly as the interest rate increases.")
# Boxplot of Interest Rate by Loan Status
print(
ggplot(data, aes(x = loan_status, y = loan_int_rate, fill = loan_status)) +
geom_boxplot() +
labs(title = "Interest Rate by Loan Status", x = "Loan Status", y = "Interest Rate")
)
cat("The median interest rate is higher for approved loans compared to rejected ones. Approved loans also show a wider range of interest rates, including higher maximum values.")
# Distribution of Credit History Length by Loan Status
print(
ggplot(data, aes(x = cb_person_cred_hist_length, fill = loan_status)) +
geom_histogram(bins = 20, color = "black", position = "dodge") +
labs(title = "Distribution of Credit History Length by Loan Status", x = "Credit History Length", y = "Count")
)
cat("Most applicants have short credit histories (0–10 years). Rejected loans dominate across shorter credit histories, while loans with longer credit histories are more likely to be approved.
")
# Boxplot of Credit History Length by Loan Status
print(
ggplot(data, aes(x = loan_status, y = cb_person_cred_hist_length, fill = loan_status)) +
geom_boxplot() +
labs(title = "Credit History Length by Loan Status", x = "Loan Status", y = "Credit History Length")
)
# --- Correlation Analysis ---
cat("--- Correlation Analysis ---")
numeric_data <- data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data, use = "complete.obs")
print(
corrplot(cor_matrix, method = "color", type = "upper",
title = "Correlation Matrix of Numeric Variables",
tl.col = "black", tl.cex = 0.8)
)
}
# Run the Visualization
visualize_data(data)
## --- Visualizing Distributions for All Columns ---
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## --- Visualizing Categorical Variables with Loan Status ---
## A higher number of loans were not approved for both males and females, but males had more approvals compared to females.
## Bachelor's and Associate degree holders had the highest number of loan applications, with more loans not approved than approved across all education levels. However, Doctorate holders show relatively higher approval rates compared to other groups.
## Applicants who rent or have mortgages make up the majority of applications, with more loans not approved than approved in these groups. Homeowners (OWN) had a smaller number of applications but relatively higher approval rates.
## Debt consolidation and personal loans have the highest number of applications, with significantly more rejections. Educational loans have relatively fewer applications and higher approval rates.
## Applicants with previous loan defaults had far more rejections compared to approvals, while those without defaults had better approval rates.
## --- Visualizing Numerical Variables with Loan Status ---
## Most applicants are younger, with a peak in the 20–40 age range. Across all ages, there are more loans not approved than approved.
## The majority of applicants have low incomes, with very few high-income individuals. Loans are more frequently rejected at lower income levels.
## The median income for both approved and not-approved loans is similar, but there are outliers with very high incomes, especially among those not approved.
## Most loan amounts are concentrated in the range of 0–15,000. Loans with higher amounts are less frequent, and rejections dominate across all loan sizes.
## The median loan amount is slightly higher for approved loans compared to not approved loans. However, there is significant overlap in the distributions, with a wide range of amounts for both groups.
## Interest rates primarily cluster around the range of 8–12%, with more loans rejected than approved across all interest rates. The rejection rate decreases slightly as the interest rate increases.
## The median interest rate is higher for approved loans compared to rejected ones. Approved loans also show a wider range of interest rates, including higher maximum values.
## Most applicants have short credit histories (0–10 years). Rejected loans dominate across shorter credit histories, while loans with longer credit histories are more likely to be approved.
## --- Correlation Analysis ---
## $corr
## person_age person_income person_emp_exp loan_amnt
## person_age 1.00000000 0.193697781 0.95441216 0.050749541
## person_income 0.19369778 1.000000000 0.18598715 0.242290131
## person_emp_exp 0.95441216 0.185987147 1.00000000 0.044589394
## loan_amnt 0.05074954 0.242290131 0.04458939 1.000000000
## loan_int_rate 0.01340164 0.001509828 0.01663134 0.146093082
## loan_percent_income -0.04329864 -0.234176548 -0.03986153 0.593011449
## cb_person_cred_hist_length 0.86198456 0.124315644 0.82427154 0.042969328
## credit_score 0.17843247 0.035919225 0.18619613 0.009074282
## loan_int_rate loan_percent_income
## person_age 0.013401640 -0.04329864
## person_income 0.001509828 -0.23417655
## person_emp_exp 0.016631344 -0.03986153
## loan_amnt 0.146093082 0.59301145
## loan_int_rate 1.000000000 0.12520949
## loan_percent_income 0.125209488 1.00000000
## cb_person_cred_hist_length 0.018007997 -0.03186773
## credit_score 0.011497752 -0.01148310
## cb_person_cred_hist_length credit_score
## person_age 0.86198456 0.178432470
## person_income 0.12431564 0.035919225
## person_emp_exp 0.82427154 0.186196134
## loan_amnt 0.04296933 0.009074282
## loan_int_rate 0.01800800 0.011497752
## loan_percent_income -0.03186773 -0.011483096
## cb_person_cred_hist_length 1.00000000 0.155204130
## credit_score 0.15520413 1.000000000
##
## $corrPos
## xName yName x y corr
## 1 person_age person_age 1 8 1.000000000
## 2 person_income person_age 2 8 0.193697781
## 3 person_income person_income 2 7 1.000000000
## 4 person_emp_exp person_age 3 8 0.954412161
## 5 person_emp_exp person_income 3 7 0.185987147
## 6 person_emp_exp person_emp_exp 3 6 1.000000000
## 7 loan_amnt person_age 4 8 0.050749541
## 8 loan_amnt person_income 4 7 0.242290131
## 9 loan_amnt person_emp_exp 4 6 0.044589394
## 10 loan_amnt loan_amnt 4 5 1.000000000
## 11 loan_int_rate person_age 5 8 0.013401640
## 12 loan_int_rate person_income 5 7 0.001509828
## 13 loan_int_rate person_emp_exp 5 6 0.016631344
## 14 loan_int_rate loan_amnt 5 5 0.146093082
## 15 loan_int_rate loan_int_rate 5 4 1.000000000
## 16 loan_percent_income person_age 6 8 -0.043298644
## 17 loan_percent_income person_income 6 7 -0.234176548
## 18 loan_percent_income person_emp_exp 6 6 -0.039861528
## 19 loan_percent_income loan_amnt 6 5 0.593011449
## 20 loan_percent_income loan_int_rate 6 4 0.125209488
## 21 loan_percent_income loan_percent_income 6 3 1.000000000
## 22 cb_person_cred_hist_length person_age 7 8 0.861984558
## 23 cb_person_cred_hist_length person_income 7 7 0.124315644
## 24 cb_person_cred_hist_length person_emp_exp 7 6 0.824271542
## 25 cb_person_cred_hist_length loan_amnt 7 5 0.042969328
## 26 cb_person_cred_hist_length loan_int_rate 7 4 0.018007997
## 27 cb_person_cred_hist_length loan_percent_income 7 3 -0.031867734
## 28 cb_person_cred_hist_length cb_person_cred_hist_length 7 2 1.000000000
## 29 credit_score person_age 8 8 0.178432470
## 30 credit_score person_income 8 7 0.035919225
## 31 credit_score person_emp_exp 8 6 0.186196134
## 32 credit_score loan_amnt 8 5 0.009074282
## 33 credit_score loan_int_rate 8 4 0.011497752
## 34 credit_score loan_percent_income 8 3 -0.011483096
## 35 credit_score cb_person_cred_hist_length 8 2 0.155204130
## 36 credit_score credit_score 8 1 1.000000000
##
## $arg
## $arg$type
## [1] "upper"
person_income is greater than 0.person_income is greater than
0.loan_amnt is greater than 0.person_income into categories based on
defined ranges.person_age into age categories.loan_status into a binary factor
with levels No and Yes.engineer_features <- function(data) {
data <- data %>%
mutate(
# Derived Features
loan_amnt_income_ratio = ifelse(person_income > 0, loan_amnt / person_income, NA),
loan_interest_income_ratio = ifelse(person_income > 0, loan_int_rate / person_income, NA),
age_loan_ratio = ifelse(loan_amnt > 0, person_age / loan_amnt, NA),
# Interaction Features
age_income_interaction = person_age * person_income,
income_credit_interaction = person_income * credit_score,
# Binning and Categorization
income_bin = cut(
person_income,
breaks = c(0, 30000, 60000, 90000, Inf),
labels = c("Low", "Medium", "High", "Very High"),
right = FALSE
),
age_group = cut(
person_age,
breaks = c(0, 25, 35, 50, Inf),
labels = c("Young", "Mid-Age", "Senior", "Elderly"),
right = FALSE
),
# Creditworthiness Indicators
good_credit = as.numeric(credit_score >= 650),
high_loan_ratio = as.numeric(loan_percent_income > 0.3),
long_credit_history = as.numeric(cb_person_cred_hist_length > 10),
# Transformations for Modeling
loan_status = factor(loan_status, levels = c("Not Approved", "Approved"), labels = c("No", "Yes")),
person_gender = as.numeric(person_gender == "female"),
person_education = as.numeric(factor(person_education, ordered = TRUE)), # Check ordering relevance
person_home_ownership = as.numeric(factor(person_home_ownership)), # One-hot encoding may be better
loan_intent = as.numeric(factor(loan_intent)),
previous_loan_defaults_on_file = as.numeric(previous_loan_defaults_on_file == "Yes")
)
return(data)
}
data <- engineer_features(data)
Split data into training and testing set to 80:20.
# --- Split Data into Training and Testing Sets (80-20 Split) ---
set.seed(100)
trainIndex <- createDataPartition(data$loan_status, p = 0.8, list = FALSE)
training_set <- data[trainIndex, ]
testing_set <- data[-trainIndex, ]
# Check class distribution in training and testing sets
print("Class distribution in training set:")
## [1] "Class distribution in training set:"
print(table(training_set$loan_status))
##
## No Yes
## 28000 8000
print("Class distribution in testing set:")
## [1] "Class distribution in testing set:"
print(table(testing_set$loan_status))
##
## No Yes
## 7000 2000
Identify numerical features that are highly correlated with each other (correlation > 0.8) and may cause multicollinearity issues in the model.
Highly correlated features:
* loan_amnt_income_ratio
* person_age
* income_credit_interaction
* person_income
* person_emp_exp
Identify significant categorical features associated with the target variable (loan_status).
Significant categorical features:
* income_bin (likely a categorical version of income levels).
* age_group (likely a categorical version of age brackets).
Use a machine learning algorithm to rank features and select the most predictive ones.
Selected features:
* previous_loan_defaults_on_file
* person_home_ownership
* loan_int_rate
* loan_amnt_income_ratio
* loan_intent
* loan_interest_income_ratio
* person_income
* income_credit_interaction
* credit_score
* loan_percent_income
* age_income_interaction
* loan_amnt
* age_loan_ratio
* person_age
* person_emp_exp
perform_feature_selection <- function(training_set) {
# 1. Correlation Analysis for Numerical Variables
cat("--- Step 1: Correlation Analysis ---")
# Extract Numerical Features
numeric_features_train <- training_set %>% select(where(is.numeric), -loan_status)
# Compute Correlation Matrix
cor_matrix <- cor(numeric_features_train, use = "complete.obs")
# Identify Highly Correlated Variables (Correlation > 0.8)
highly_correlated <- findCorrelation(cor_matrix, cutoff = 0.8, names = TRUE)
cat("Highly Correlated Features:")
print(highly_correlated)
# Correlation Analysis Output
correlation_analysis_output <- list(
correlation_matrix = cor_matrix,
highly_correlated_features = highly_correlated
)
# 2. Chi-Square Test for Categorical Variables
cat("--- Step 2: Chi-Square Test for Categorical Variables ---")
# Extract Categorical Features
categorical_features_train <- training_set %>% select(where(is.factor), -loan_status)
# Perform Chi-Square Test
chi_square_results <- lapply(categorical_features_train, function(var) {
chisq.test(table(training_set$loan_status, var))
})
# Collect Chi-Square Results
chi_square_summary <- data.frame(
Feature = names(chi_square_results),
P_Value = sapply(chi_square_results, function(x) x$p.value)
)
# Filter Significant Features (p-value < 0.05)
significant_categorical_vars <- chi_square_summary %>%
filter(P_Value < 0.05) %>%
pull(Feature)
cat("Significant Categorical Features:")
print(significant_categorical_vars)
# Chi-Square Analysis Output
chi_square_analysis_output <- list(
chi_square_summary = chi_square_summary,
significant_features = significant_categorical_vars
)
# 3. Recursive Feature Elimination (RFE)
cat("--- Step 3: Recursive Feature Elimination (RFE) ---")
# Define RFE Control
rfe_control <- rfeControl(
functions = rfFuncs, # Random Forest-Based Feature Ranking
method = "cv", # Cross-Validation
number = 5 # 5-Fold Cross-Validation
)
# Perform RFE
set.seed(100)
rfe_results <- rfe(
x = training_set %>% select(-loan_status), # Exclude Target Variable
y = training_set$loan_status,
sizes = c(5, 10, 15), # Evaluate Top 5, 10, 15 Features
rfeControl = rfe_control
)
# Extract Selected Features
rfe_selected_features <- predictors(rfe_results)
cat("Selected Features from RFE:")
print(rfe_selected_features)
# RFE Output
rfe_analysis_output <- list(
rfe_results = rfe_results,
selected_features = rfe_selected_features
)
# Combine Outputs into a Summary
feature_selection_summary <- list(
correlation_analysis = correlation_analysis_output,
chi_square_analysis = chi_square_analysis_output,
rfe_analysis = rfe_analysis_output
)
return(feature_selection_summary)
}
# Perform Feature Selection
feature_selection_results <- perform_feature_selection(training_set)
## --- Step 1: Correlation Analysis ---Highly Correlated Features:[1] "loan_amnt_income_ratio" "person_age"
## [3] "income_credit_interaction" "person_income"
## [5] "person_emp_exp"
## --- Step 2: Chi-Square Test for Categorical Variables ---Significant Categorical Features:[1] "income_bin" "age_group"
## --- Step 3: Recursive Feature Elimination (RFE) ---Selected Features from RFE: [1] "previous_loan_defaults_on_file" "person_home_ownership"
## [3] "loan_int_rate" "loan_amnt_income_ratio"
## [5] "loan_intent" "loan_interest_income_ratio"
## [7] "person_income" "income_credit_interaction"
## [9] "credit_score" "loan_percent_income"
## [11] "age_income_interaction" "loan_amnt"
## [13] "age_loan_ratio" "person_age"
## [15] "person_emp_exp"
# Access Selected Features from RFE
selected_features <- feature_selection_results$rfe_analysis$selected_features
Reduce Features: The training and testing sets are reduced to include only the selected features (from the feature selection step) and the target variable (loan_status).
The dataset likely has an imbalanced class distribution in loan_status (e.g., many more loans may be approved than denied). The ROSE (Random Over-Sampling Examples) method generates synthetic samples to balance the classes.
A 5-fold cross-validation is used to ensure the models are evaluated on multiple data subsets.
# Reduce Training and Testing Sets to Selected Features
training_set <- training_set %>%
select(all_of(c(selected_features, "loan_status")))
testing_set <- testing_set %>%
select(all_of(c(selected_features, "loan_status")))
# --- Oversampling with ROSE ---
training_set_balanced <- ROSE(loan_status ~ ., data = training_set, seed = 100)$data
# Verify Class Distribution After ROSE
cat("Class distribution after applying ROSE:")
## Class distribution after applying ROSE:
print(table(training_set_balanced$loan_status))
##
## No Yes
## 18119 17881
# --- Define Control for Cross-Validation ---
control <- trainControl(
method = "cv",
number = 5,
summaryFunction = defaultSummary,
classProbs = TRUE,
)
# --- Define Function to Train Models ---
train_models <- function(data, control) {
model_list <- list()
# Logistic Regression
model_list$logistic <- train(
loan_status ~ ., data = data, method = "glm",
family = "binomial", trControl = control
)
# Decision Tree
model_list$decision_tree <- train(
loan_status ~ ., data = data, method = "rpart",
trControl = control
)
# Random Forest
model_list$random_forest <- train(
loan_status ~ ., data = data, method = "rf",
trControl = control
)
# Support Vector Machine (SVM)
model_list$svm <- train(
loan_status ~ ., data = data, kernel = "linear",
cost = 1, trControl = control
)
# Gradient Boosted Decision Trees (GBM)
gbm_grid <- expand.grid(
n.trees = c(50, 100, 150),
interaction.depth = c(1, 3, 5),
shrinkage = c(0.01, 0.1),
n.minobsinnode = c(5, 10)
)
model_list$gbm <- train(
loan_status ~ ., data = data, method = "gbm",
verbose = FALSE,
trControl = control
)
return(model_list)
}
# Train Models on Balanced Training Set
model_list <- train_models(training_set_balanced, control)
# Define a Mapping for Model Names
model_name_mapping <- list(
logistic = "Logistic Regression",
decision_tree = "Decision Tree",
random_forest = "Random Forest",
svm = "Support Vector Machine",
gbm = "Gradient Boosting Machine",
knn = "k-Nearest Neighbors"
)
The predict() function generates predicted probabilities for the “Yes” class (loan_status = Yes).
Predicted probabilities are converted into class labels (“Yes” or “No”). A threshold of 0.5 is used: if the probability is greater than 0.5, the predicted class is “Yes”; otherwise, it is “No”.
Compares the predicted classes (predicted_classes) with the actual
classes (actuals).
Outputs key metrics:
* Accuracy: Proportion of correctly classified instances.
* Precision (Positive Predictive Value): Percentage of correctly
predicted “Yes” out of all predicted “Yes”.
* Recall (Sensitivity): Percentage of correctly predicted “Yes” out of
all actual “Yes”.
* F1-Score: Harmonic mean of precision and recall.
evaluate_model <- function(model, test_data) {
# Generate Predictions
predictions <- predict(model, test_data, type = "prob")[, "Yes"]
# Predicted Classes
predicted_classes <- ifelse(predictions > 0.5, "Yes", "No")
actuals <- test_data$loan_status
# Confusion Matrix
cm <- confusionMatrix(as.factor(predicted_classes), actuals)
# ROC and AUC
roc_obj <- roc(response = actuals, predictor = predictions, levels = rev(levels(actuals)))
auc <- auc(roc_obj)
# Return Metrics
list(
Accuracy = cm$overall["Accuracy"],
Precision = cm$byClass["Pos Pred Value"],
Recall = cm$byClass["Sensitivity"],
F1_Score = 2 * (cm$byClass["Pos Pred Value"] * cm$byClass["Sensitivity"]) /
(cm$byClass["Pos Pred Value"] + cm$byClass["Sensitivity"]),
AUC = auc,
ROC = roc_obj
)
}
# Evaluate All Models
results <- lapply(model_list, evaluate_model, test_data = testing_set)
## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases
# Summarize Results in Tabular Format
results_table <- do.call(rbind, lapply(names(results), function(name) {
metrics <- results[[name]]
data.frame(
Model = name,
Accuracy = round(metrics$Accuracy, 4),
Precision = round(metrics$Precision, 4),
Recall = round(metrics$Recall, 4),
F1_Score = round(metrics$F1_Score, 4),
AUC = round(metrics$AUC, 4)
)
}))
# Print the Results Table
print(results_table)
## Model Accuracy Precision Recall F1_Score AUC
## Accuracy logistic 0.8448 0.9722 0.8240 0.8920 0.9490
## Accuracy1 decision_tree 0.7297 1.0000 0.6524 0.7897 0.8262
## Accuracy2 random_forest 0.7297 1.0000 0.6524 0.7897 0.8353
## Accuracy3 svm 0.7297 1.0000 0.6524 0.7897 0.8360
## Accuracy4 gbm 0.7297 1.0000 0.6524 0.7897 0.8262
# --- Function to Plot ROC Curves ---
plot_roc_curves <- function(results, model_name_mapping) {
# Combine All ROC Curves into a Single Data Frame
roc_data <- do.call(rbind, lapply(names(results), function(model_name) {
roc_obj <- results[[model_name]]$ROC
auc <- round(results[[model_name]]$AUC, 4)
data.frame(
FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities,
Model = paste(model_name_mapping[[model_name]], "(AUC =", auc, ")")
)
}))
# Plot ROC Curves
ggplot(roc_data, aes(x = FPR, y = TPR, color = Model, group = Model)) +
geom_line(linewidth = 1) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
labs(
title = "ROC Curves for All Models",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)",
color = "Model"
) +
theme_minimal() +
theme(legend.position = "right")
}
# Plot ROC Curves
plot_roc_curves(results, model_name_mapping)
# --- Function to Plot Evaluation Metrics ---
plot_evaluation_metrics <- function(results_table) {
# Reshape the Results Table for Visualization
results_melted <- results_table %>%
pivot_longer(-Model, names_to = "Metric", values_to = "Value")
# Plot Evaluation Metrics
ggplot(results_melted, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Evaluation Metrics", x = "Model", y = "Value") +
theme_minimal() +
theme(legend.position = "right")
}
# Plot Evaluation Metrics
plot_evaluation_metrics(results_table)
Logistic Regression: Has the highest AUC (0.9490), indicating the best discriminatory ability between classes. Balanced Precision (0.9722) and Recall (0.8240), leading to a high F1-Score (0.8920). Accuracy is also the highest (0.8448).
Other Models (Decision Tree, Random Forest, SVM, GBM): These models all show identical results with lower metrics compared to Logistic Regression. Accuracy, Precision, Recall, F1-Score, and AUC are consistent across these models. Their AUC values are notably lower than Logistic Regression.
# Load necessary libraries
library(randomForest)
library(dplyr)
library(caret)
# Load the dataset
data <- read.csv("/Users/qinnihoo/Documents/R/loan_data.csv")
# Convert categorical variables to factors
data <- data %>%
mutate(
person_gender = as.factor(person_gender),
person_education = as.factor(person_education),
person_home_ownership = as.factor(person_home_ownership),
loan_intent = as.factor(loan_intent),
previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
)
# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(data$loan_amnt, p = 0.7, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
Model Training: A Random Forest regression model is trained to predict loan_amnt using features like person_age, person_income, loan_int_rate, and others with 100 trees (ntree = 100).
Prediction: The model generates predictions for the loan amount on the test_data.
Evaluation Metrics:
Performance:
Example output: MSE = 543,148.8 and RMSE = 736.99, indicating the
average error in prediction.
Feature Importance:
Visualization: The varImpPlot function is used to display the relative importance of features visually.
# Train a Random Forest regression model for loan amount
rf_model_loan <- randomForest(loan_amnt ~ person_age + person_income + person_emp_exp +
loan_int_rate + loan_percent_income + cb_person_cred_hist_length +
credit_score + person_gender + person_education +
person_home_ownership + loan_intent + previous_loan_defaults_on_file,
data = train_data, ntree = 100)
# Predict on the test set
test_data$predicted_loan_amnt <- predict(rf_model_loan, newdata = test_data)
# Evaluate the model
mse_loan_amnt <- mean((test_data$loan_amnt - test_data$predicted_loan_amnt)^2)
rmse_loan_amnt <- sqrt(mse_loan_amnt)
cat("Loan Amount Prediction:\n")
## Loan Amount Prediction:
cat("MSE:", mse_loan_amnt, "\n")
## MSE: 543148.8
cat("RMSE:", rmse_loan_amnt, "\n")
## RMSE: 736.9863
# Feature importance for loan amount
cat("Feature Importance for Loan Amount:\n")
## Feature Importance for Loan Amount:
importance_loan <- importance(rf_model_loan)
print(importance_loan)
## IncNodePurity
## person_age 10245909603
## person_income 479310784455
## person_emp_exp 7871545182
## loan_int_rate 28807554105
## loan_percent_income 661699407844
## cb_person_cred_hist_length 7067338826
## credit_score 11728805943
## person_gender 1524511066
## person_education 5716033598
## person_home_ownership 22402933665
## loan_intent 10254984045
## previous_loan_defaults_on_file 5243744120
varImpPlot(rf_model_loan)
# Train a Random Forest regression model for interest rate
rf_model_int_rate <- randomForest(loan_int_rate ~ person_age + person_income + person_emp_exp +
loan_amnt + loan_percent_income + cb_person_cred_hist_length +
credit_score + person_gender + person_education +
person_home_ownership + loan_intent + previous_loan_defaults_on_file,
data = train_data, ntree = 100)
# Predict on the test set
test_data$predicted_loan_int_rate <- predict(rf_model_int_rate, newdata = test_data)
# Evaluate the model
mse_loan_int_rate <- mean((test_data$loan_int_rate - test_data$predicted_loan_int_rate)^2)
rmse_loan_int_rate <- sqrt(mse_loan_int_rate)
cat("Loan Interest Rate Prediction:\n")
## Loan Interest Rate Prediction:
cat("MSE:", mse_loan_int_rate, "\n")
## MSE: 8.109797
cat("RMSE:", rmse_loan_int_rate, "\n")
## RMSE: 2.847771
# Feature importance for interest rate
cat("Feature Importance for Loan Interest Rate:\n")
## Feature Importance for Loan Interest Rate:
importance_int_rate <- importance(rf_model_int_rate)
print(importance_int_rate)
## IncNodePurity
## person_age 20845.929
## person_income 43393.059
## person_emp_exp 19706.331
## loan_amnt 34760.463
## loan_percent_income 23874.598
## cb_person_cred_hist_length 17307.675
## credit_score 37710.715
## person_gender 5217.932
## person_education 15582.682
## person_home_ownership 7508.895
## loan_intent 18325.282
## previous_loan_defaults_on_file 8332.540
varImpPlot(rf_model_int_rate)
# Convert loan_status to factor for classification
train_data$loan_status <- as.factor(train_data$loan_status)
test_data$loan_status <- as.factor(test_data$loan_status)
# Train a Random Forest classification model for loan repayment likelihood
rf_model_status <- randomForest(loan_status ~ person_age + person_income + person_emp_exp +
loan_amnt + loan_int_rate + loan_percent_income +
cb_person_cred_hist_length + credit_score + person_gender +
person_education + person_home_ownership + loan_intent +
previous_loan_defaults_on_file,
data = train_data, ntree = 100)
# Feature importance for loan repayment likelihood
cat("Feature Importance for Loan Repayment Likelihood:\n")
## Feature Importance for Loan Repayment Likelihood:
importance_status <- importance(rf_model_status)
print(importance_status)
## MeanDecreaseGini
## person_age 315.41011
## person_income 1291.48457
## person_emp_exp 286.31172
## loan_amnt 578.34962
## loan_int_rate 1749.62648
## loan_percent_income 1850.36828
## cb_person_cred_hist_length 261.69952
## credit_score 598.21502
## person_gender 71.75154
## person_education 213.85369
## person_home_ownership 691.47747
## loan_intent 490.33033
## previous_loan_defaults_on_file 2434.02190
varImpPlot(rf_model_status)
# Load necessary libraries
# library(caret)
# library(randomForest)
# Load the dataset
data <- read.csv("/Users/qinnihoo/Documents/R/loan_data.csv")
# Convert categorical variables to factors
data <- data %>%
mutate(
person_gender = as.factor(person_gender),
person_education = as.factor(person_education),
person_home_ownership = as.factor(person_home_ownership),
loan_intent = as.factor(loan_intent),
previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
)
# Define a new dataset focusing on risk-related features
risk_data <- data %>%
select(loan_int_rate, person_income, loan_percent_income,
credit_score, person_age, person_emp_exp,
previous_loan_defaults_on_file, loan_amnt)
# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(risk_data$loan_int_rate, p = 0.7, list = FALSE)
train_data <- risk_data[train_indices, ]
test_data <- risk_data[-train_indices, ]
# Train a Random Forest regression model to predict loan_int_rate
set.seed(123)
rf_model_risk <- randomForest(loan_int_rate ~ person_income + loan_percent_income +
credit_score + person_age + person_emp_exp +
previous_loan_defaults_on_file + loan_amnt,
data = train_data, ntree = 100)
# Predict on the test set
test_data$predicted_risk <- predict(rf_model_risk, newdata = test_data)
# Evaluate the model
mse_risk <- mean((test_data$loan_int_rate - test_data$predicted_risk)^2)
rmse_risk <- sqrt(mse_risk)
cat("Risk Prediction Model - Loan Interest Rate:\n")
## Risk Prediction Model - Loan Interest Rate:
cat("MSE:", mse_risk, "\n")
## MSE: 8.264744
cat("RMSE:", rmse_risk, "\n")
## RMSE: 2.874847
# Feature importance to quantify key factors driving risk
importance <- importance(rf_model_risk)
cat("Feature Importance:\n")
## Feature Importance:
print(importance)
## IncNodePurity
## person_income 55788.757
## loan_percent_income 27026.031
## credit_score 47257.204
## person_age 26147.611
## person_emp_exp 23759.011
## previous_loan_defaults_on_file 9100.841
## loan_amnt 39961.960
# Plot feature importance
varImpPlot(rf_model_risk)
This model identifies the factors that most influence the interest rate assigned to a loan, with person_income, credit_score, and loan_amount playing the biggest roles.
This analysis highlights the importance of factors like income, age, credit history, and loan-related ratios in determining loan approval outcomes. By addressing data imbalances and focusing on meaningful features, the models became more reliable and accurate. Gradient Boosting and Random Forest performed the best, showing strong predictive capabilities. The study underscores how careful data preparation and thoughtful model selection can lead to better decision-making tools for lenders. With further refinement or the inclusion of additional data, these models could provide even more reliable insights, helping financial institutions make smarter, fairer lending decisions.