Title: Loan Application Decision Classification

Team members

Name	Student ID
Ang Jian Wei	22060578
Chew Yong Khang	22060597
Hoo Qin Ni	22105331
Liew Yao Qin	22110206
Mah Seau Sher	22115483

=================

1. Introduction

=================

1. Project Objective

To use machine learning techniques assist bank employees in reviewing loan eligibility by analyzing customer data details submitted through the online application form. The details filled by customer are gender, marital status, education, income, employment experience, person employment experience, person home ownership, loan amount request, loan intent and credit score.

Predict loan eligibility of potential customers based on given inputs.
Predict loan amount or loan interest rate based on individual and related attributes.

2. Project Methodology

To tackle a loan approval classification project using both regression and classification models, we using OSEMN framework, which stands for obtain, scrub, explore, model and interpret.

Obtain:
- The dataset is obtain through Kaggle, as followed: Loan Approval Classification Dataset.
Scrub:
- Handle missing values.
- Identify and handle derived features and interaction features.
- Label encoding for categorical data.
- Split into training and testing dataset.
Exploratory:
- Summary of the cleaned dataset.
- Data visualization use plot histogram, bar, box plots and other visualize method to visualize the relationship between features and the target variables.
- Perform correlation matrices to identify relationship between features and the target variable.
Models:
This project aims to build and evaluate predictive machine learning models using a diverse set of algorithms. By exploring different approaches, we aim to identify the most effective algorithm for our specific dataset and problem domain. The algorithms under consideration include:
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Support Vector Machine
5. Gradient Boosting Machine
Interpret:
The accuracy, precision, recall, F1-score and AUC will be used to evaluate the developed model’s performance and the best model will be chosen as the final model for the loan status prediction.

3. Dataset Description

Loan Approval Classification Dataset last updated in October 2024.
https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/data
The dataset can be used for multiple purposes such as Exploratory Data Analysis (EDA), Classification and Regression.
This dataset contains 45000 rows and 14 columns.

Columns	Description	Type
person_age	Age of the person	Float
person_gender	Gender of the person	Categorical
person_education	Highest education level	Categorical
person_income	Annual income	Float
person_emp_exp	Years of employment experience	Integer
person_home_ownership	Home ownership status (rent, own, mortgage)	Categorical
loan_amnt	Loan amount requested	Float
loan_intent	Purpose of the loan	Categorical
loan_int_rate	Loan interest rate	Float
loan_percent_income	Loan amount as a percentage of annual income	Float
cb_person_cred_hist_length	Length of credit history in years	Float
credit_score	Credit score of the person	Integer
previous_loan_defaults_on_file	Indicator of previous loan defaults	Categorical
loan_status (target variable)	Loan approval status: 1 = approved; 0 = rejected	Integer

This dataset offers a valuable foundation for analyzing financial risk factors and conducting predictive modeling for loan approvals and credit scoring.

============================

2. Load Required Libraries

============================

For Loop: Iterates over each package name in the packages vector.
require(pkg, character.only = TRUE):
- Checks if the package is already installed and loads it if available.
- character.only = TRUE allows dynamic package names (i.e., variable pkg).
install.packages(pkg, dependencies = TRUE):
- Installs the package if it is not already installed.
- dependencies = TRUE: Ensures all dependent packages are also installed.
library(pkg, character.only = TRUE):
- Loads the installed package into the R session.

packages <- c(
  "tidyverse", "janitor", "caret", "rpart", "rpart.plot", 
  "randomForest", "gbm", "kknn", "fastDummies", "MLmetrics", "corrplot", 
  "PerformanceAnalytics", "kernlab", "ROSE", "pROC"
)

for (pkg in packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
    library(pkg, character.only = TRUE)
  }
}

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: janitor
## 
## 
## Attaching package: 'janitor'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## 
## Loading required package: caret
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: rpart
## 
## Loading required package: rpart.plot
## 
## Loading required package: randomForest
## 
## randomForest 4.7-1.2
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## 
## Loading required package: gbm
## 
## Loaded gbm 2.2.2
## 
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
## 
## Loading required package: kknn
## 
## 
## Attaching package: 'kknn'
## 
## 
## The following object is masked from 'package:caret':
## 
##     contr.dummy
## 
## 
## Loading required package: fastDummies
## 
## Loading required package: MLmetrics
## 
## 
## Attaching package: 'MLmetrics'
## 
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## 
## The following object is masked from 'package:base':
## 
##     Recall
## 
## 
## Loading required package: corrplot
## 
## corrplot 0.95 loaded
## 
## Loading required package: PerformanceAnalytics
## 
## Loading required package: xts
## 
## Loading required package: zoo
## 
## 
## Attaching package: 'zoo'
## 
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## 
## Attaching package: 'xts'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## 
## The following object is masked from 'package:graphics':
## 
##     legend
## 
## 
## Loading required package: kernlab
## 
## 
## Attaching package: 'kernlab'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## 
## Loading required package: ROSE
## 
## Loaded ROSE 0.0-4
## 
## 
## Loading required package: pROC
## 
## Type 'citation("pROC")' for a citation.
## 
## 
## Attaching package: 'pROC'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

==============================

3. Data Loading and Cleaning

==============================

— Load and Transform Data —

1. Read CSV file

Reads a CSV file located at the specific computer location.

2. Cleaning Column Names

Standardizes column names by converting them to lowercase, removing special characters, and replacing spaces with underscores for easier reference.

3. Transform Columns

The mutate() function from dplyr is used to modify or create new columns. It transforms several columns into factor variables (categorical data), which are useful for statistical modeling or data analysis such as converts loan_status into a factor with two levels:
* “Not Approved”: Represents one level.
* “Approved”: Represents the other level.

data <- read_csv("/Users/qinnihoo/Documents/R/loan_data.csv") %>%
  clean_names() %>%
  mutate(
    loan_status = factor(loan_status, labels = c("Not Approved", "Approved")),
    person_gender = as.factor(person_gender),
    person_education = as.factor(person_education),
    person_home_ownership = as.factor(person_home_ownership),
    loan_intent = as.factor(loan_intent),
    previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
  )

## Rows: 45000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): person_gender, person_education, person_home_ownership, loan_intent...
## dbl (9): person_age, person_income, person_emp_exp, loan_amnt, loan_int_rate...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

— Cleaning Data —

1. Check for Missing Values

There is no missing values in this dataset.

2. Check for Null Values

There is no null values in this dataset.

3. Check for Duplicate Rows

There is no duplicate rows in this dataset.

This dataset is good and clean enough already.

explore_data <- function(data) {
  cat("--- Data Overview ---")
  print(glimpse(data))
  print(summary(data))
  
  # --- Check for Missing Values ---
  cat("--- Missing Value Analysis ---")
  missing_values <- data %>%
    summarise(across(everything(), ~ sum(is.na(.)))) %>%
    pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
    mutate(Total_Rows = nrow(data), Missing_Percentage = (Missing_Count / Total_Rows) * 100) %>%
    arrange(desc(Missing_Count))
  print(missing_values)
  
  # --- Check for Null Values ---
  cat("--- Null Value Analysis ---")
  null_values <- sapply(data, function(x) sum(is.null(x)))
  null_values <- data.frame(Column = names(null_values), Null_Count = null_values)
  print(null_values)
  
  # --- Check for Duplicate Rows ---
  cat("--- Duplicate Row Analysis ---")
  duplicate_rows <- data[duplicated(data), ]
  cat("Number of duplicate rows: ", nrow(duplicate_rows), "")
  
  if (nrow(duplicate_rows) > 0) {
    print(head(duplicate_rows, 5))  # Show first 5 duplicate rows
  }
}
# Run Data Exploration
explore_data(data)

## --- Data Overview ---Rows: 45,000
## Columns: 14
## $ person_age                     <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21,…
## $ person_gender                  <fct> female, female, female, female, male, f…
## $ person_education               <fct> Master, High School, High School, Bache…
## $ person_income                  <dbl> 71948, 12282, 12438, 79753, 66135, 1295…
## $ person_emp_exp                 <dbl> 0, 0, 3, 0, 1, 0, 1, 5, 3, 0, 0, 0, 3, …
## $ person_home_ownership          <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, R…
## $ loan_amnt                      <dbl> 35000, 1000, 5500, 35000, 35000, 2500, …
## $ loan_intent                    <fct> PERSONAL, EDUCATION, MEDICAL, MEDICAL, …
## $ loan_int_rate                  <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14…
## $ loan_percent_income            <dbl> 0.49, 0.08, 0.44, 0.44, 0.53, 0.19, 0.3…
## $ cb_person_cred_hist_length     <dbl> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, …
## $ credit_score                   <dbl> 561, 504, 635, 675, 586, 532, 701, 585,…
## $ previous_loan_defaults_on_file <fct> No, Yes, No, No, No, No, No, No, No, No…
## $ loan_status                    <fct> Approved, Not Approved, Approved, Appro…
## # A tibble: 45,000 × 14
##    person_age person_gender person_education person_income person_emp_exp
##         <dbl> <fct>         <fct>                    <dbl>          <dbl>
##  1         22 female        Master                   71948              0
##  2         21 female        High School              12282              0
##  3         25 female        High School              12438              3
##  4         23 female        Bachelor                 79753              0
##  5         24 male          Master                   66135              1
##  6         21 female        High School              12951              0
##  7         26 female        Bachelor                 93471              1
##  8         24 female        High School              95550              5
##  9         24 female        Associate               100684              3
## 10         21 female        High School              12739              0
## # ℹ 44,990 more rows
## # ℹ 9 more variables: person_home_ownership <fct>, loan_amnt <dbl>,
## #   loan_intent <fct>, loan_int_rate <dbl>, loan_percent_income <dbl>,
## #   cb_person_cred_hist_length <dbl>, credit_score <dbl>,
## #   previous_loan_defaults_on_file <fct>, loan_status <fct>
##    person_age     person_gender     person_education person_income    
##  Min.   : 20.00   female:20159   Associate  :12028   Min.   :   8000  
##  1st Qu.: 24.00   male  :24841   Bachelor   :13399   1st Qu.:  47204  
##  Median : 26.00                  Doctorate  :  621   Median :  67048  
##  Mean   : 27.76                  High School:11972   Mean   :  80319  
##  3rd Qu.: 30.00                  Master     : 6980   3rd Qu.:  95789  
##  Max.   :144.00                                      Max.   :7200766  
##  person_emp_exp   person_home_ownership   loan_amnt    
##  Min.   :  0.00   MORTGAGE:18489        Min.   :  500  
##  1st Qu.:  1.00   OTHER   :  117        1st Qu.: 5000  
##  Median :  4.00   OWN     : 2951        Median : 8000  
##  Mean   :  5.41   RENT    :23443        Mean   : 9583  
##  3rd Qu.:  8.00                         3rd Qu.:12237  
##  Max.   :125.00                         Max.   :35000  
##             loan_intent   loan_int_rate   loan_percent_income
##  DEBTCONSOLIDATION:7145   Min.   : 5.42   Min.   :0.0000     
##  EDUCATION        :9153   1st Qu.: 8.59   1st Qu.:0.0700     
##  HOMEIMPROVEMENT  :4783   Median :11.01   Median :0.1200     
##  MEDICAL          :8548   Mean   :11.01   Mean   :0.1397     
##  PERSONAL         :7552   3rd Qu.:12.99   3rd Qu.:0.1900     
##  VENTURE          :7819   Max.   :20.00   Max.   :0.6600     
##  cb_person_cred_hist_length  credit_score   previous_loan_defaults_on_file
##  Min.   : 2.000             Min.   :390.0   No :22142                     
##  1st Qu.: 3.000             1st Qu.:601.0   Yes:22858                     
##  Median : 4.000             Median :640.0                                 
##  Mean   : 5.867             Mean   :632.6                                 
##  3rd Qu.: 8.000             3rd Qu.:670.0                                 
##  Max.   :30.000             Max.   :850.0                                 
##        loan_status   
##  Not Approved:35000  
##  Approved    :10000  
##                      
##                      
##                      
##                      
## --- Missing Value Analysis ---# A tibble: 14 × 4
##    Column                         Missing_Count Total_Rows Missing_Percentage
##    <chr>                                  <int>      <int>              <dbl>
##  1 person_age                                 0      45000                  0
##  2 person_gender                              0      45000                  0
##  3 person_education                           0      45000                  0
##  4 person_income                              0      45000                  0
##  5 person_emp_exp                             0      45000                  0
##  6 person_home_ownership                      0      45000                  0
##  7 loan_amnt                                  0      45000                  0
##  8 loan_intent                                0      45000                  0
##  9 loan_int_rate                              0      45000                  0
## 10 loan_percent_income                        0      45000                  0
## 11 cb_person_cred_hist_length                 0      45000                  0
## 12 credit_score                               0      45000                  0
## 13 previous_loan_defaults_on_file             0      45000                  0
## 14 loan_status                                0      45000                  0
## --- Null Value Analysis ---                                                       Column Null_Count
## person_age                                         person_age          0
## person_gender                                   person_gender          0
## person_education                             person_education          0
## person_income                                   person_income          0
## person_emp_exp                                 person_emp_exp          0
## person_home_ownership                   person_home_ownership          0
## loan_amnt                                           loan_amnt          0
## loan_intent                                       loan_intent          0
## loan_int_rate                                   loan_int_rate          0
## loan_percent_income                       loan_percent_income          0
## cb_person_cred_hist_length         cb_person_cred_hist_length          0
## credit_score                                     credit_score          0
## previous_loan_defaults_on_file previous_loan_defaults_on_file          0
## loan_status                                       loan_status          0
## --- Duplicate Row Analysis ---Number of duplicate rows:  0

====================================

4. Exploratory Data Analysis (EDA)

====================================

visualize_data <- function(data) {
  # --- Distributions for All Columns ---
  cat("--- Visualizing Distributions for All Columns ---")
  
  # Identify numerical and categorical columns
  numeric_columns <- data %>% select(where(is.numeric)) %>% names()
  categorical_columns <- data %>% select(where(is.factor)) %>% names()
  
  # Visualize numerical columns
  for (col in numeric_columns) {
    print(
      ggplot(data, aes_string(x = col)) +
        geom_histogram(bins = 30, color = "black", fill = "blue", alpha = 0.7) +
        labs(
          title = paste("Distribution of", col),
          x = col,
          y = "Frequency"
        ) +
        theme_minimal()
    )
  }
  
  # Visualize categorical columns
  for (col in categorical_columns) {
    print(
      ggplot(data, aes_string(x = col)) +
        geom_bar(color = "black", fill = "orange", alpha = 0.7) +
        labs(
          title = paste("Distribution of", col),
          x = col,
          y = "Count"
        ) +
        theme_minimal()
    )
  }
  
  # --- Categorical Variables and Loan Status ---
  cat("--- Visualizing Categorical Variables with Loan Status ---")
  
  # Loan Status by Gender
  print(
    ggplot(data, aes(x = person_gender, fill = loan_status)) +
      geom_bar(position = "dodge") +
      labs(title = "Loan Status by Gender", x = "Gender", y = "Count")
  )
  cat("A higher number of loans were not approved for both males and females, but males had more approvals compared to females.")
  
  # Loan Status by Education Level
  print(
    ggplot(data, aes(x = person_education, fill = loan_status)) +
      geom_bar(position = "dodge") +
      labs(title = "Loan Status by Education Level", x = "Education Level", y = "Count")
  )
  cat("Bachelor's and Associate degree holders had the highest number of loan applications, with more loans not approved than approved across all education levels. However, Doctorate holders show relatively higher approval rates compared to other groups.")
  
  # Loan Status by Home Ownership
  print(
    ggplot(data, aes(x = person_home_ownership, fill = loan_status)) +
      geom_bar(position = "dodge") +
      labs(title = "Loan Status by Home Ownership", x = "Home Ownership", y = "Count")
  )
  cat("Applicants who rent or have mortgages make up the majority of applications, with more loans not approved than approved in these groups. Homeowners (OWN) had a smaller number of applications but relatively higher approval rates.")
  
  # Loan Status by Loan Intent
  print(
    ggplot(data, aes(x = loan_intent, fill = loan_status)) +
      geom_bar(position = "dodge") +
      labs(title = "Loan Status by Loan Intent", x = "Loan Intent", y = "Count")
  )
  cat("Debt consolidation and personal loans have the highest number of applications, with significantly more rejections. Educational loans have relatively fewer applications and higher approval rates.")
  
  # Loan Status by Previous Loan Defaults
  print(
    ggplot(data, aes(x = previous_loan_defaults_on_file, fill = loan_status)) +
      geom_bar(position = "dodge") +
      labs(title = "Loan Status by Previous Loan Defaults", x = "Previous Loan Defaults", y = "Count")
  )
  cat("Applicants with previous loan defaults had far more rejections compared to approvals, while those without defaults had better approval rates.")
  
  # --- Numerical Variables and Loan Status ---
  cat("\n--- Visualizing Numerical Variables with Loan Status ---\n")
  
  # Distribution of Age by Loan Status
  print(
    ggplot(data, aes(x = person_age, fill = loan_status)) +
      geom_histogram(bins = 20, color = "black", position = "dodge") +
      labs(title = "Distribution of Age by Loan Status", x = "Age", y = "Count")
  )
  cat("Most applicants are younger, with a peak in the 20–40 age range. Across all ages, there are more loans not approved than approved.")
  
  # Distribution of Income by Loan Status
  print(
    ggplot(data, aes(x = person_income, fill = loan_status)) +
      geom_histogram(bins = 30, color = "black", position = "dodge") +
      labs(title = "Distribution of Income by Loan Status", x = "Income", y = "Count")
  )
  cat("The majority of applicants have low incomes, with very few high-income individuals. Loans are more frequently rejected at lower income levels.")
  
  # Boxplot of Income by Loan Status
  print(
    ggplot(data, aes(x = loan_status, y = person_income, fill = loan_status)) +
      geom_boxplot() +
      labs(title = "Income by Loan Status", x = "Loan Status", y = "Income")
  )
  cat("The median income for both approved and not-approved loans is similar, but there are outliers with very high incomes, especially among those not approved.")
  
  # Distribution of Loan Amount by Loan Status
  print(
    ggplot(data, aes(x = loan_amnt, fill = loan_status)) +
      geom_histogram(bins = 30, color = "black", position = "dodge") +
      labs(title = "Distribution of Loan Amount by Loan Status", x = "Loan Amount", y = "Count")
  )
  cat("Most loan amounts are concentrated in the range of 0–15,000. Loans with higher amounts are less frequent, and rejections dominate across all loan sizes.")
  
  # Boxplot of Loan Amount by Loan Status
  print(
    ggplot(data, aes(x = loan_status, y = loan_amnt, fill = loan_status)) +
      geom_boxplot() +
      labs(title = "Loan Amount by Loan Status", x = "Loan Status", y = "Loan Amount")
  )
  cat("The median loan amount is slightly higher for approved loans compared to not approved loans. However, there is significant overlap in the distributions, with a wide range of amounts for both groups.")
  
  # Distribution of Interest Rate by Loan Status
  print(
    ggplot(data, aes(x = loan_int_rate, fill = loan_status)) +
      geom_histogram(bins = 30, color = "black", position = "dodge") +
      labs(title = "Distribution of Interest Rate by Loan Status", x = "Interest Rate", y = "Count")
  )
  cat("Interest rates primarily cluster around the range of 8–12%, with more loans rejected than approved across all interest rates. The rejection rate decreases slightly as the interest rate increases.")
  
  # Boxplot of Interest Rate by Loan Status
  print(
    ggplot(data, aes(x = loan_status, y = loan_int_rate, fill = loan_status)) +
      geom_boxplot() +
      labs(title = "Interest Rate by Loan Status", x = "Loan Status", y = "Interest Rate")
  )
  cat("The median interest rate is higher for approved loans compared to rejected ones. Approved loans also show a wider range of interest rates, including higher maximum values.")
  
  # Distribution of Credit History Length by Loan Status
  print(
    ggplot(data, aes(x = cb_person_cred_hist_length, fill = loan_status)) +
      geom_histogram(bins = 20, color = "black", position = "dodge") +
      labs(title = "Distribution of Credit History Length by Loan Status", x = "Credit History Length", y = "Count")
  )
  cat("Most applicants have short credit histories (0–10 years). Rejected loans dominate across shorter credit histories, while loans with longer credit histories are more likely to be approved.
")
  
  # Boxplot of Credit History Length by Loan Status
  print(
    ggplot(data, aes(x = loan_status, y = cb_person_cred_hist_length, fill = loan_status)) +
      geom_boxplot() +
      labs(title = "Credit History Length by Loan Status", x = "Loan Status", y = "Credit History Length")
  )

  # --- Correlation Analysis ---
  cat("--- Correlation Analysis ---")
  numeric_data <- data %>% select(where(is.numeric))
  cor_matrix <- cor(numeric_data, use = "complete.obs")
  print(
    corrplot(cor_matrix, method = "color", type = "upper", 
             title = "Correlation Matrix of Numeric Variables", 
             tl.col = "black", tl.cex = 0.8)
  )
}

# Run the Visualization
visualize_data(data)

## --- Visualizing Distributions for All Columns ---

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## --- Visualizing Categorical Variables with Loan Status ---

## A higher number of loans were not approved for both males and females, but males had more approvals compared to females.

## Bachelor's and Associate degree holders had the highest number of loan applications, with more loans not approved than approved across all education levels. However, Doctorate holders show relatively higher approval rates compared to other groups.

## Applicants who rent or have mortgages make up the majority of applications, with more loans not approved than approved in these groups. Homeowners (OWN) had a smaller number of applications but relatively higher approval rates.

## Debt consolidation and personal loans have the highest number of applications, with significantly more rejections. Educational loans have relatively fewer applications and higher approval rates.

## Applicants with previous loan defaults had far more rejections compared to approvals, while those without defaults had better approval rates.
## --- Visualizing Numerical Variables with Loan Status ---

## Most applicants are younger, with a peak in the 20–40 age range. Across all ages, there are more loans not approved than approved.

## The majority of applicants have low incomes, with very few high-income individuals. Loans are more frequently rejected at lower income levels.

## The median income for both approved and not-approved loans is similar, but there are outliers with very high incomes, especially among those not approved.

## Most loan amounts are concentrated in the range of 0–15,000. Loans with higher amounts are less frequent, and rejections dominate across all loan sizes.

## The median loan amount is slightly higher for approved loans compared to not approved loans. However, there is significant overlap in the distributions, with a wide range of amounts for both groups.

## Interest rates primarily cluster around the range of 8–12%, with more loans rejected than approved across all interest rates. The rejection rate decreases slightly as the interest rate increases.

## The median interest rate is higher for approved loans compared to rejected ones. Approved loans also show a wider range of interest rates, including higher maximum values.

## Most applicants have short credit histories (0–10 years). Rejected loans dominate across shorter credit histories, while loans with longer credit histories are more likely to be approved.

## --- Correlation Analysis ---

## $corr
##                             person_age person_income person_emp_exp   loan_amnt
## person_age                  1.00000000   0.193697781     0.95441216 0.050749541
## person_income               0.19369778   1.000000000     0.18598715 0.242290131
## person_emp_exp              0.95441216   0.185987147     1.00000000 0.044589394
## loan_amnt                   0.05074954   0.242290131     0.04458939 1.000000000
## loan_int_rate               0.01340164   0.001509828     0.01663134 0.146093082
## loan_percent_income        -0.04329864  -0.234176548    -0.03986153 0.593011449
## cb_person_cred_hist_length  0.86198456   0.124315644     0.82427154 0.042969328
## credit_score                0.17843247   0.035919225     0.18619613 0.009074282
##                            loan_int_rate loan_percent_income
## person_age                   0.013401640         -0.04329864
## person_income                0.001509828         -0.23417655
## person_emp_exp               0.016631344         -0.03986153
## loan_amnt                    0.146093082          0.59301145
## loan_int_rate                1.000000000          0.12520949
## loan_percent_income          0.125209488          1.00000000
## cb_person_cred_hist_length   0.018007997         -0.03186773
## credit_score                 0.011497752         -0.01148310
##                            cb_person_cred_hist_length credit_score
## person_age                                 0.86198456  0.178432470
## person_income                              0.12431564  0.035919225
## person_emp_exp                             0.82427154  0.186196134
## loan_amnt                                  0.04296933  0.009074282
## loan_int_rate                              0.01800800  0.011497752
## loan_percent_income                       -0.03186773 -0.011483096
## cb_person_cred_hist_length                 1.00000000  0.155204130
## credit_score                               0.15520413  1.000000000
## 
## $corrPos
##                         xName                      yName x y         corr
## 1                  person_age                 person_age 1 8  1.000000000
## 2               person_income                 person_age 2 8  0.193697781
## 3               person_income              person_income 2 7  1.000000000
## 4              person_emp_exp                 person_age 3 8  0.954412161
## 5              person_emp_exp              person_income 3 7  0.185987147
## 6              person_emp_exp             person_emp_exp 3 6  1.000000000
## 7                   loan_amnt                 person_age 4 8  0.050749541
## 8                   loan_amnt              person_income 4 7  0.242290131
## 9                   loan_amnt             person_emp_exp 4 6  0.044589394
## 10                  loan_amnt                  loan_amnt 4 5  1.000000000
## 11              loan_int_rate                 person_age 5 8  0.013401640
## 12              loan_int_rate              person_income 5 7  0.001509828
## 13              loan_int_rate             person_emp_exp 5 6  0.016631344
## 14              loan_int_rate                  loan_amnt 5 5  0.146093082
## 15              loan_int_rate              loan_int_rate 5 4  1.000000000
## 16        loan_percent_income                 person_age 6 8 -0.043298644
## 17        loan_percent_income              person_income 6 7 -0.234176548
## 18        loan_percent_income             person_emp_exp 6 6 -0.039861528
## 19        loan_percent_income                  loan_amnt 6 5  0.593011449
## 20        loan_percent_income              loan_int_rate 6 4  0.125209488
## 21        loan_percent_income        loan_percent_income 6 3  1.000000000
## 22 cb_person_cred_hist_length                 person_age 7 8  0.861984558
## 23 cb_person_cred_hist_length              person_income 7 7  0.124315644
## 24 cb_person_cred_hist_length             person_emp_exp 7 6  0.824271542
## 25 cb_person_cred_hist_length                  loan_amnt 7 5  0.042969328
## 26 cb_person_cred_hist_length              loan_int_rate 7 4  0.018007997
## 27 cb_person_cred_hist_length        loan_percent_income 7 3 -0.031867734
## 28 cb_person_cred_hist_length cb_person_cred_hist_length 7 2  1.000000000
## 29               credit_score                 person_age 8 8  0.178432470
## 30               credit_score              person_income 8 7  0.035919225
## 31               credit_score             person_emp_exp 8 6  0.186196134
## 32               credit_score                  loan_amnt 8 5  0.009074282
## 33               credit_score              loan_int_rate 8 4  0.011497752
## 34               credit_score        loan_percent_income 8 3 -0.011483096
## 35               credit_score cb_person_cred_hist_length 8 2  0.155204130
## 36               credit_score               credit_score 8 1  1.000000000
## 
## $arg
## $arg$type
## [1] "upper"

=======================

5. Feature Engineering

=======================

1. Derived Features

loan_amnt_income_ratio: Ratio of loan amount to personal income, computed only if person_income is greater than 0.
loan_interest_income_ratio: Ratio of loan interest rate to personal income, computed only if person_income is greater than 0.
age_loan_ratio: Ratio of person’s age to loan amount, computed only if loan_amnt is greater than 0.

2. Interaction Features

age_income_interaction: Interaction effect between age and income.
income_credit_interaction: Interaction effect between income and credit score.

3. Binning and Categorization

income_bin: Bins person_income into categories based on defined ranges.
age_group: Bins person_age into age categories.

4. Creditworthiness Indicators

good_credit: Indicates if the credit score is 650 or higher.
high_loan_ratio: Indicates if the loan represents more than 30% of income.
long_credit_history: Indicates if the credit history is longer than 10 years.

5. Transformations for Modeling

loan_status: Converts loan_status into a binary factor with levels No and Yes.
person_gender: Encodes gender as 1 for female and 0 otherwise.
person_education: Converts education levels into ordered numeric values. (Ordering must be verified to align with meaning.).
person_home_ownership: Encodes home ownership as numeric. (Consider one-hot encoding if there are many unique values.).
loan_intent: Encodes loan intent into numeric categories.
previous_loan_defaults_on_file: Converts a “Yes”/“No” variable into binary numeric format.

engineer_features <- function(data) {
  data <- data %>%
    mutate(
      # Derived Features
      loan_amnt_income_ratio = ifelse(person_income > 0, loan_amnt / person_income, NA),
      loan_interest_income_ratio = ifelse(person_income > 0, loan_int_rate / person_income, NA),
      age_loan_ratio = ifelse(loan_amnt > 0, person_age / loan_amnt, NA),
      
      # Interaction Features
      age_income_interaction = person_age * person_income,
      income_credit_interaction = person_income * credit_score,
      
      # Binning and Categorization
      income_bin = cut(
        person_income, 
        breaks = c(0, 30000, 60000, 90000, Inf), 
        labels = c("Low", "Medium", "High", "Very High"),
        right = FALSE
      ),
      age_group = cut(
        person_age, 
        breaks = c(0, 25, 35, 50, Inf), 
        labels = c("Young", "Mid-Age", "Senior", "Elderly"),
        right = FALSE
      ),
      
      # Creditworthiness Indicators
      good_credit = as.numeric(credit_score >= 650),
      high_loan_ratio = as.numeric(loan_percent_income > 0.3),
      long_credit_history = as.numeric(cb_person_cred_hist_length > 10),
      
      # Transformations for Modeling
      loan_status = factor(loan_status, levels = c("Not Approved", "Approved"), labels = c("No", "Yes")),
      person_gender = as.numeric(person_gender == "female"),
      person_education = as.numeric(factor(person_education, ordered = TRUE)), # Check ordering relevance
      person_home_ownership = as.numeric(factor(person_home_ownership)), # One-hot encoding may be better
      loan_intent = as.numeric(factor(loan_intent)),
      previous_loan_defaults_on_file = as.numeric(previous_loan_defaults_on_file == "Yes")
    )
  
  return(data)
}

data <- engineer_features(data)

==============

6. Split Data

==============

Split data into training and testing set to 80:20.

# --- Split Data into Training and Testing Sets (80-20 Split) ---
set.seed(100)
trainIndex <- createDataPartition(data$loan_status, p = 0.8, list = FALSE)
training_set <- data[trainIndex, ]
testing_set <- data[-trainIndex, ]

# Check class distribution in training and testing sets
print("Class distribution in training set:")

## [1] "Class distribution in training set:"

print(table(training_set$loan_status))

## 
##    No   Yes 
## 28000  8000

print("Class distribution in testing set:")

## [1] "Class distribution in testing set:"

print(table(testing_set$loan_status))

## 
##   No  Yes 
## 7000 2000

=====================

7. Feature Selection

=====================

1. Correlation Analysis for Numercial Variables

Identify numerical features that are highly correlated with each other (correlation > 0.8) and may cause multicollinearity issues in the model.

Highly correlated features:
* loan_amnt_income_ratio
* person_age
* income_credit_interaction
* person_income
* person_emp_exp

2. Chi-Square Test for Categorical Variables

Identify significant categorical features associated with the target variable (loan_status).

Significant categorical features:
* income_bin (likely a categorical version of income levels).
* age_group (likely a categorical version of age brackets).

3. Recursive Feature Elimination (RFE)

Use a machine learning algorithm to rank features and select the most predictive ones.

Selected features:
* previous_loan_defaults_on_file
* person_home_ownership
* loan_int_rate
* loan_amnt_income_ratio
* loan_intent
* loan_interest_income_ratio
* person_income
* income_credit_interaction
* credit_score
* loan_percent_income
* age_income_interaction
* loan_amnt
* age_loan_ratio
* person_age
* person_emp_exp

perform_feature_selection <- function(training_set) {
  
  # 1. Correlation Analysis for Numerical Variables
  cat("--- Step 1: Correlation Analysis ---")
  
  # Extract Numerical Features
  numeric_features_train <- training_set %>% select(where(is.numeric), -loan_status)
  
  # Compute Correlation Matrix
  cor_matrix <- cor(numeric_features_train, use = "complete.obs")
  
  # Identify Highly Correlated Variables (Correlation > 0.8)
  highly_correlated <- findCorrelation(cor_matrix, cutoff = 0.8, names = TRUE)
  cat("Highly Correlated Features:")
  print(highly_correlated)
  
  # Correlation Analysis Output
  correlation_analysis_output <- list(
    correlation_matrix = cor_matrix,
    highly_correlated_features = highly_correlated
  )
  
  # 2. Chi-Square Test for Categorical Variables
  cat("--- Step 2: Chi-Square Test for Categorical Variables ---")
  
  # Extract Categorical Features
  categorical_features_train <- training_set %>% select(where(is.factor), -loan_status)
  
  # Perform Chi-Square Test
  chi_square_results <- lapply(categorical_features_train, function(var) {
    chisq.test(table(training_set$loan_status, var))
  })
  
  # Collect Chi-Square Results
  chi_square_summary <- data.frame(
    Feature = names(chi_square_results),
    P_Value = sapply(chi_square_results, function(x) x$p.value)
  )
  
  # Filter Significant Features (p-value < 0.05)
  significant_categorical_vars <- chi_square_summary %>%
    filter(P_Value < 0.05) %>%
    pull(Feature)
  cat("Significant Categorical Features:")
  print(significant_categorical_vars)
  
  # Chi-Square Analysis Output
  chi_square_analysis_output <- list(
    chi_square_summary = chi_square_summary,
    significant_features = significant_categorical_vars
  )
  
  # 3. Recursive Feature Elimination (RFE)
  cat("--- Step 3: Recursive Feature Elimination (RFE) ---")
  
  # Define RFE Control
  rfe_control <- rfeControl(
    functions = rfFuncs,  # Random Forest-Based Feature Ranking
    method = "cv",        # Cross-Validation
    number = 5            # 5-Fold Cross-Validation
  )
  
  # Perform RFE
  set.seed(100)
  rfe_results <- rfe(
    x = training_set %>% select(-loan_status),  # Exclude Target Variable
    y = training_set$loan_status,
    sizes = c(5, 10, 15),                       # Evaluate Top 5, 10, 15 Features
    rfeControl = rfe_control
  )
  
  # Extract Selected Features
  rfe_selected_features <- predictors(rfe_results)
  cat("Selected Features from RFE:")
  print(rfe_selected_features)
  
  # RFE Output
  rfe_analysis_output <- list(
    rfe_results = rfe_results,
    selected_features = rfe_selected_features
  )
  
  # Combine Outputs into a Summary
  feature_selection_summary <- list(
    correlation_analysis = correlation_analysis_output,
    chi_square_analysis = chi_square_analysis_output,
    rfe_analysis = rfe_analysis_output
  )
  
  return(feature_selection_summary)
}

# Perform Feature Selection
feature_selection_results <- perform_feature_selection(training_set)

## --- Step 1: Correlation Analysis ---Highly Correlated Features:[1] "loan_amnt_income_ratio"    "person_age"               
## [3] "income_credit_interaction" "person_income"            
## [5] "person_emp_exp"           
## --- Step 2: Chi-Square Test for Categorical Variables ---Significant Categorical Features:[1] "income_bin" "age_group" 
## --- Step 3: Recursive Feature Elimination (RFE) ---Selected Features from RFE: [1] "previous_loan_defaults_on_file" "person_home_ownership"         
##  [3] "loan_int_rate"                  "loan_amnt_income_ratio"        
##  [5] "loan_intent"                    "loan_interest_income_ratio"    
##  [7] "person_income"                  "income_credit_interaction"     
##  [9] "credit_score"                   "loan_percent_income"           
## [11] "age_income_interaction"         "loan_amnt"                     
## [13] "age_loan_ratio"                 "person_age"                    
## [15] "person_emp_exp"

# Access Selected Features from RFE
selected_features <- feature_selection_results$rfe_analysis$selected_features

==============================

8. Modelling (Classification)

==============================

1. Data Preparation

Reduce Features: The training and testing sets are reduced to include only the selected features (from the feature selection step) and the target variable (loan_status).

2. Handling Class Imbalance (ROSE)

The dataset likely has an imbalanced class distribution in loan_status (e.g., many more loans may be approved than denied). The ROSE (Random Over-Sampling Examples) method generates synthetic samples to balance the classes.

3. Cross-Validation Setup

A 5-fold cross-validation is used to ensure the models are evaluated on multiple data subsets.

4. 5 machine learning algorithms selected:

Logistic Regression
Decision Tree
Random Forest
Support Vector Machine (SVM)
Gradient Boosted Decision Trees (GBM)

# Reduce Training and Testing Sets to Selected Features
training_set <- training_set %>%
  select(all_of(c(selected_features, "loan_status")))
testing_set <- testing_set %>%
  select(all_of(c(selected_features, "loan_status")))

# --- Oversampling with ROSE ---
training_set_balanced <- ROSE(loan_status ~ ., data = training_set, seed = 100)$data

# Verify Class Distribution After ROSE
cat("Class distribution after applying ROSE:")

## Class distribution after applying ROSE:

print(table(training_set_balanced$loan_status))

## 
##    No   Yes 
## 18119 17881

# --- Define Control for Cross-Validation ---
control <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = defaultSummary,
  classProbs = TRUE,
)

# --- Define Function to Train Models ---
train_models <- function(data, control) {
  model_list <- list()
  
  # Logistic Regression
  model_list$logistic <- train(
    loan_status ~ ., data = data, method = "glm",
    family = "binomial", trControl = control
  )
  
  # Decision Tree
  model_list$decision_tree <- train(
    loan_status ~ ., data = data, method = "rpart", 
    trControl = control
  )
  
  # Random Forest
  model_list$random_forest <- train(
    loan_status ~ ., data = data, method = "rf", 
    trControl = control
  )
  
  # Support Vector Machine (SVM)
  model_list$svm <- train(
    loan_status ~ ., data = data, kernel = "linear", 
    cost = 1, trControl = control
  )
  
  # Gradient Boosted Decision Trees (GBM)
  gbm_grid <- expand.grid(
    n.trees = c(50, 100, 150), 
    interaction.depth = c(1, 3, 5), 
    shrinkage = c(0.01, 0.1), 
    n.minobsinnode = c(5, 10)
  )
  model_list$gbm <- train(
    loan_status ~ ., data = data, method = "gbm",
    verbose = FALSE, 
    trControl = control
  )
  
  return(model_list)
}


# Train Models on Balanced Training Set
model_list <- train_models(training_set_balanced, control)

# Define a Mapping for Model Names
model_name_mapping <- list(
  logistic = "Logistic Regression",
  decision_tree = "Decision Tree",
  random_forest = "Random Forest",
  svm = "Support Vector Machine",
  gbm = "Gradient Boosting Machine",
  knn = "k-Nearest Neighbors"
)

====================

9. Model Evaluation

====================

1. Generate Predictions (predict()):

The predict() function generates predicted probabilities for the “Yes” class (loan_status = Yes).

2. Generate Predicted Classes (ifelse()):

Predicted probabilities are converted into class labels (“Yes” or “No”). A threshold of 0.5 is used: if the probability is greater than 0.5, the predicted class is “Yes”; otherwise, it is “No”.

3.Confusion Matrix (confusionMatrix()):

Compares the predicted classes (predicted_classes) with the actual classes (actuals).
Outputs key metrics:
* Accuracy: Proportion of correctly classified instances.
* Precision (Positive Predictive Value): Percentage of correctly predicted “Yes” out of all predicted “Yes”.
* Recall (Sensitivity): Percentage of correctly predicted “Yes” out of all actual “Yes”.
* F1-Score: Harmonic mean of precision and recall.

4.ROC and AUC (roc() and auc()):

ROC (Receiver Operating Characteristic) curve: Plots the trade-off between true positive rate (TPR) and false positive rate (FPR) for different thresholds.
AUC (Area Under the Curve): Measures the ability of the model to discriminate between classes (values range from 0 to 1, where 1 is ideal).

evaluate_model <- function(model, test_data) {
  # Generate Predictions
  predictions <- predict(model, test_data, type = "prob")[, "Yes"]
  
  # Predicted Classes
  predicted_classes <- ifelse(predictions > 0.5, "Yes", "No")
  actuals <- test_data$loan_status
  
  # Confusion Matrix
  cm <- confusionMatrix(as.factor(predicted_classes), actuals)
  
  # ROC and AUC
  roc_obj <- roc(response = actuals, predictor = predictions, levels = rev(levels(actuals)))
  auc <- auc(roc_obj)
  
  # Return Metrics
  list(
    Accuracy = cm$overall["Accuracy"],
    Precision = cm$byClass["Pos Pred Value"],
    Recall = cm$byClass["Sensitivity"],
    F1_Score = 2 * (cm$byClass["Pos Pred Value"] * cm$byClass["Sensitivity"]) /
      (cm$byClass["Pos Pred Value"] + cm$byClass["Sensitivity"]),
    AUC = auc,
    ROC = roc_obj
  )
}

# Evaluate All Models
results <- lapply(model_list, evaluate_model, test_data = testing_set)

## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases
## Setting direction: controls > cases

==============================

10. Results and Visualization

==============================

# Summarize Results in Tabular Format
results_table <- do.call(rbind, lapply(names(results), function(name) {
  metrics <- results[[name]]
  data.frame(
    Model = name,
    Accuracy = round(metrics$Accuracy, 4),
    Precision = round(metrics$Precision, 4),
    Recall = round(metrics$Recall, 4),
    F1_Score = round(metrics$F1_Score, 4),
    AUC = round(metrics$AUC, 4)
  )
}))

# Print the Results Table
print(results_table)

##                   Model Accuracy Precision Recall F1_Score    AUC
## Accuracy       logistic   0.8448    0.9722 0.8240   0.8920 0.9490
## Accuracy1 decision_tree   0.7297    1.0000 0.6524   0.7897 0.8262
## Accuracy2 random_forest   0.7297    1.0000 0.6524   0.7897 0.8353
## Accuracy3           svm   0.7297    1.0000 0.6524   0.7897 0.8360
## Accuracy4           gbm   0.7297    1.0000 0.6524   0.7897 0.8262

# --- Function to Plot ROC Curves ---
plot_roc_curves <- function(results, model_name_mapping) {
  # Combine All ROC Curves into a Single Data Frame
  roc_data <- do.call(rbind, lapply(names(results), function(model_name) {
    roc_obj <- results[[model_name]]$ROC
    auc <- round(results[[model_name]]$AUC, 4)
    data.frame(
      FPR = 1 - roc_obj$specificities,
      TPR = roc_obj$sensitivities,
      Model = paste(model_name_mapping[[model_name]], "(AUC =", auc, ")")
    )
  }))
  
  # Plot ROC Curves
  ggplot(roc_data, aes(x = FPR, y = TPR, color = Model, group = Model)) +
    geom_line(linewidth = 1) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    labs(
      title = "ROC Curves for All Models",
      x = "False Positive Rate (1 - Specificity)",
      y = "True Positive Rate (Sensitivity)",
      color = "Model"
    ) +
    theme_minimal() +
    theme(legend.position = "right")
}

# Plot ROC Curves
plot_roc_curves(results, model_name_mapping)

# --- Function to Plot Evaluation Metrics ---
plot_evaluation_metrics <- function(results_table) {
  # Reshape the Results Table for Visualization
  results_melted <- results_table %>%
    pivot_longer(-Model, names_to = "Metric", values_to = "Value")
  
  # Plot Evaluation Metrics
  ggplot(results_melted, aes(x = Model, y = Value, fill = Metric)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Model Evaluation Metrics", x = "Model", y = "Value") +
    theme_minimal() +
    theme(legend.position = "right")
}

# Plot Evaluation Metrics
plot_evaluation_metrics(results_table)

Logistic Regression: Has the highest AUC (0.9490), indicating the best discriminatory ability between classes. Balanced Precision (0.9722) and Recall (0.8240), leading to a high F1-Score (0.8920). Accuracy is also the highest (0.8448).
Other Models (Decision Tree, Random Forest, SVM, GBM): These models all show identical results with lower metrics compared to Logistic Regression. Accuracy, Precision, Recall, F1-Score, and AUC are consistent across these models. Their AUC values are notably lower than Logistic Regression.

=================================

11. Regression on Random Forest

=================================

# Load necessary libraries
library(randomForest)
library(dplyr)
library(caret)

# Load the dataset
data <- read.csv("/Users/qinnihoo/Documents/R/loan_data.csv") 

# Convert categorical variables to factors
data <- data %>%
  mutate(
    person_gender = as.factor(person_gender),
    person_education = as.factor(person_education),
    person_home_ownership = as.factor(person_home_ownership),
    loan_intent = as.factor(loan_intent),
    previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
  )

# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(data$loan_amnt, p = 0.7, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

=====================================

12. Predict Loan Amount (loan_amnt)

=====================================

Model Training: A Random Forest regression model is trained to predict loan_amnt using features like person_age, person_income, loan_int_rate, and others with 100 trees (ntree = 100).
Prediction: The model generates predictions for the loan amount on the test_data.
Evaluation Metrics:
- Mean Squared Error (MSE): Measures the average squared difference between actual and predicted loan amounts.
- Root Mean Squared Error (RMSE): The square root of MSE, providing an interpretable measure of prediction error magnitude.
Performance:
Example output: MSE = 543,148.8 and RMSE = 736.99, indicating the average error in prediction.
Feature Importance:
- Identifies which features contribute the most to the model’s predictions.
- Key influential features include person_income, loan_percent_income, and loan_int_rate.
Visualization: The varImpPlot function is used to display the relative importance of features visually.

# Train a Random Forest regression model for loan amount
rf_model_loan <- randomForest(loan_amnt ~ person_age + person_income + person_emp_exp +
                                loan_int_rate + loan_percent_income + cb_person_cred_hist_length +
                                credit_score + person_gender + person_education +
                                person_home_ownership + loan_intent + previous_loan_defaults_on_file,
                              data = train_data, ntree = 100)

# Predict on the test set
test_data$predicted_loan_amnt <- predict(rf_model_loan, newdata = test_data)

# Evaluate the model
mse_loan_amnt <- mean((test_data$loan_amnt - test_data$predicted_loan_amnt)^2)
rmse_loan_amnt <- sqrt(mse_loan_amnt)

cat("Loan Amount Prediction:\n")

## Loan Amount Prediction:

cat("MSE:", mse_loan_amnt, "\n")

## MSE: 543148.8

cat("RMSE:", rmse_loan_amnt, "\n")

## RMSE: 736.9863

# Feature importance for loan amount
cat("Feature Importance for Loan Amount:\n")

## Feature Importance for Loan Amount:

importance_loan <- importance(rf_model_loan)
print(importance_loan)

##                                IncNodePurity
## person_age                       10245909603
## person_income                   479310784455
## person_emp_exp                    7871545182
## loan_int_rate                    28807554105
## loan_percent_income             661699407844
## cb_person_cred_hist_length        7067338826
## credit_score                     11728805943
## person_gender                     1524511066
## person_education                  5716033598
## person_home_ownership            22402933665
## loan_intent                      10254984045
## previous_loan_defaults_on_file    5243744120

varImpPlot(rf_model_loan)

================================================

13. Predict Loan Interest Rate (loan_int_rate)

================================================

Model Evaluation:
- Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted interest rates. Example: MSE = 8.11.
- Root Mean Squared Error (RMSE): The square root of MSE, providing an interpretable measure of prediction error. Example: RMSE = 2.85, meaning the average error in predicting interest rates is approximately 2.85%.
Features contributing the most to the model include:
- person_income: Most significant predictor.
- credit_score: Second most important factor.
- loan_amnt: Also a key contributor.
- Less impactful features include person_gender, person_home_ownership, and previous_loan_defaults_on_file.

# Train a Random Forest regression model for interest rate
rf_model_int_rate <- randomForest(loan_int_rate ~ person_age + person_income + person_emp_exp +
                                    loan_amnt + loan_percent_income + cb_person_cred_hist_length +
                                    credit_score + person_gender + person_education +
                                    person_home_ownership + loan_intent + previous_loan_defaults_on_file,
                                  data = train_data, ntree = 100)

# Predict on the test set
test_data$predicted_loan_int_rate <- predict(rf_model_int_rate, newdata = test_data)

# Evaluate the model
mse_loan_int_rate <- mean((test_data$loan_int_rate - test_data$predicted_loan_int_rate)^2)
rmse_loan_int_rate <- sqrt(mse_loan_int_rate)

cat("Loan Interest Rate Prediction:\n")

## Loan Interest Rate Prediction:

cat("MSE:", mse_loan_int_rate, "\n")

## MSE: 8.109797

cat("RMSE:", rmse_loan_int_rate, "\n")

## RMSE: 2.847771

# Feature importance for interest rate
cat("Feature Importance for Loan Interest Rate:\n")

## Feature Importance for Loan Interest Rate:

importance_int_rate <- importance(rf_model_int_rate)
print(importance_int_rate)

##                                IncNodePurity
## person_age                         20845.929
## person_income                      43393.059
## person_emp_exp                     19706.331
## loan_amnt                          34760.463
## loan_percent_income                23874.598
## cb_person_cred_hist_length         17307.675
## credit_score                       37710.715
## person_gender                       5217.932
## person_education                   15582.682
## person_home_ownership               7508.895
## loan_intent                        18325.282
## previous_loan_defaults_on_file      8332.540

varImpPlot(rf_model_int_rate)

===========================================

14. Predicting Repayment Likelihood

===========================================

Key influential features:
- previous_loan_defaults_on_file: Most impactful in predicting loan repayment likelihood.
- loan_percent_income and loan_int_rate: Other significant predictors.
- person_income and credit_score: Also highly important.
- Less impactful features include: person_gender and person_education, which contribute the least to the model.

# Convert loan_status to factor for classification
train_data$loan_status <- as.factor(train_data$loan_status)
test_data$loan_status <- as.factor(test_data$loan_status)

# Train a Random Forest classification model for loan repayment likelihood
rf_model_status <- randomForest(loan_status ~ person_age + person_income + person_emp_exp +
                                  loan_amnt + loan_int_rate + loan_percent_income +
                                  cb_person_cred_hist_length + credit_score + person_gender +
                                  person_education + person_home_ownership + loan_intent +
                                  previous_loan_defaults_on_file,
                                data = train_data, ntree = 100)

# Feature importance for loan repayment likelihood
cat("Feature Importance for Loan Repayment Likelihood:\n")

## Feature Importance for Loan Repayment Likelihood:

importance_status <- importance(rf_model_status)
print(importance_status)

##                                MeanDecreaseGini
## person_age                            315.41011
## person_income                        1291.48457
## person_emp_exp                        286.31172
## loan_amnt                             578.34962
## loan_int_rate                        1749.62648
## loan_percent_income                  1850.36828
## cb_person_cred_hist_length            261.69952
## credit_score                          598.21502
## person_gender                          71.75154
## person_education                      213.85369
## person_home_ownership                 691.47747
## loan_intent                           490.33033
## previous_loan_defaults_on_file       2434.02190

varImpPlot(rf_model_status)

===========================================

15. Risk Analysis

===========================================

# Load necessary libraries
# library(caret)
# library(randomForest)

# Load the dataset
data <- read.csv("/Users/qinnihoo/Documents/R/loan_data.csv")

# Convert categorical variables to factors
data <- data %>%
  mutate(
    person_gender = as.factor(person_gender),
    person_education = as.factor(person_education),
    person_home_ownership = as.factor(person_home_ownership),
    loan_intent = as.factor(loan_intent),
    previous_loan_defaults_on_file = as.factor(previous_loan_defaults_on_file)
  )

# Define a new dataset focusing on risk-related features
risk_data <- data %>%
  select(loan_int_rate, person_income, loan_percent_income,
         credit_score, person_age, person_emp_exp,
         previous_loan_defaults_on_file, loan_amnt)

# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(risk_data$loan_int_rate, p = 0.7, list = FALSE)
train_data <- risk_data[train_indices, ]
test_data <- risk_data[-train_indices, ]

# Train a Random Forest regression model to predict loan_int_rate
set.seed(123)
rf_model_risk <- randomForest(loan_int_rate ~ person_income + loan_percent_income +
                                credit_score + person_age + person_emp_exp +
                                previous_loan_defaults_on_file + loan_amnt,
                              data = train_data, ntree = 100)

# Predict on the test set
test_data$predicted_risk <- predict(rf_model_risk, newdata = test_data)

# Evaluate the model
mse_risk <- mean((test_data$loan_int_rate - test_data$predicted_risk)^2)
rmse_risk <- sqrt(mse_risk)

cat("Risk Prediction Model - Loan Interest Rate:\n")

## Risk Prediction Model - Loan Interest Rate:

cat("MSE:", mse_risk, "\n")

## MSE: 8.264744

cat("RMSE:", rmse_risk, "\n")

## RMSE: 2.874847

# Feature importance to quantify key factors driving risk
importance <- importance(rf_model_risk)
cat("Feature Importance:\n")

## Feature Importance:

print(importance)

##                                IncNodePurity
## person_income                      55788.757
## loan_percent_income                27026.031
## credit_score                       47257.204
## person_age                         26147.611
## person_emp_exp                     23759.011
## previous_loan_defaults_on_file      9100.841
## loan_amnt                          39961.960

# Plot feature importance
varImpPlot(rf_model_risk)

This model identifies the factors that most influence the interest rate assigned to a loan, with person_income, credit_score, and loan_amount playing the biggest roles.

===============

16. Conclusion

===============

This analysis highlights the importance of factors like income, age, credit history, and loan-related ratios in determining loan approval outcomes. By addressing data imbalances and focusing on meaningful features, the models became more reliable and accurate. Gradient Boosting and Random Forest performed the best, showing strong predictive capabilities. The study underscores how careful data preparation and thoughtful model selection can lead to better decision-making tools for lenders. With further refinement or the inclusion of additional data, these models could provide even more reliable insights, helping financial institutions make smarter, fairer lending decisions.

WQD7004 Group 7 Project Report

2025-01-08

Title: Loan Application Decision Classification

Team members

=================

1. Introduction

=================

1. Project Objective

2. Project Methodology

3. Dataset Description

============================

2. Load Required Libraries

============================

==============================

3. Data Loading and Cleaning

==============================

— Load and Transform Data —

1. Read CSV file

2. Cleaning Column Names

3. Transform Columns

— Cleaning Data —

1. Check for Missing Values

2. Check for Null Values

3. Check for Duplicate Rows

This dataset is good and clean enough already.

====================================

4. Exploratory Data Analysis (EDA)

====================================

=======================

5. Feature Engineering

=======================

1. Derived Features

2. Interaction Features

3. Binning and Categorization

4. Creditworthiness Indicators

5. Transformations for Modeling

==============

6. Split Data

==============

=====================

7. Feature Selection

=====================

1. Correlation Analysis for Numercial Variables

2. Chi-Square Test for Categorical Variables

3. Recursive Feature Elimination (RFE)

==============================

8. Modelling (Classification)

==============================

1. Data Preparation

2. Handling Class Imbalance (ROSE)

3. Cross-Validation Setup

4. 5 machine learning algorithms selected:

====================

9. Model Evaluation

====================

1. Generate Predictions (predict()):

2. Generate Predicted Classes (ifelse()):

3.Confusion Matrix (confusionMatrix()):

4.ROC and AUC (roc() and auc()):

==============================

10. Results and Visualization

==============================

=================================

11. Regression on Random Forest

=================================

=====================================

12. Predict Loan Amount (loan_amnt)

=====================================

================================================

13. Predict Loan Interest Rate (loan_int_rate)

================================================

===========================================

14. Predicting Repayment Likelihood

===========================================

===========================================

15. Risk Analysis

===========================================

===============

16. Conclusion

===============