Predictive Modelling for Loan Approval (Group 13)

Name	Matric No
Wan Nurul Adibah Binti Wan Tarmizi	24062857
Nur Anis Insyirah Binti Jowanis	24073722
Nur Izzati Binti Juhari	24075056
Puteri Safa Balqis Binti Megat Sharizal Amri	23114952
Yuhan Silvian	24077249

Introduction

Financial risk assessment is crucial in loan approval decisions, enabling banks and credit institutions to mitigate defaults while ensuring credit access for qualified applicants. In an era where credit accessibility directly influences economic growth, the ability to accurately assess borrower risk has become both a strategic and regulatory necessity.

Traditional methods rely on manual evaluation of applicant profiles such as income, credit history, and employment type, but these are often time-consuming and subject to human bias (Debabrata Dansana et al., 2023; Wu, 2024).

Recent advances in machine learning, particularly Logistic Regression, have shown high potential in predicting default risk by analyzing relationships between multiple variables (N. Penchalaiah, 2022). However, model outcomes may still be influenced by organizational policies and subjective decision making in borderline cases (Jansson et al., 2023).

This project explores how machine learning can be used to increase accuracy and promote fairness in lending decisions by reducing reliance on subjective judgment.

Problem Statement

Financial institutions’ loan approval procedures are still beset by structural inefficiencies and antiquated risk assessment techniques that fall short in addressing today’s lending issues. Despite being widely used, traditional credit scoring models show serious limits in their capacity to assess non-traditional borrowers where the models might wrongly rejected qualified borrowers while approving high-risk candidates resulting in financial loss out of total loan portfolios annually.

Thus, there is an urgent need for a machine learning framework that maintains regulatory compliance and operational scalability for lenders while extracting useful insights from multidimensional applicant data.

Objectives

RO1: To preprocess the dataset and uncover patterns, trends, and relationships within the data, which will provide valuable insights and aid subsequent modeling or decision-making processes

RO2: To develop a binary classification model (approved/not approved) and regression model (RiskScore) using machine learning that integrates dynamic feature weighting to achieve more than 80% accuracy

RO3: To identify the key variables that influence the likelihood approved/not approved decision

RO4: To evaluate and compare between machine learning models in order to find the best-performing model

Research Questions

RQ1: How can a binary classification model be developed to accurately predict loan approval decisions?

RQ2: How can a regression model be constructed to predict the risk score of loan applicants?

RQ3: What are the most significant factors influencing loan approval outcomes?

RQ4: Which machine learning model demonstrates the highest performance in predicting loan approval and assessing default risk?

Chosen Dataset

Name: Financial Risk for Loan Approval

Description: This is a synthetic dataset containing 20,000 records of individual loan applicants, covering both personal and financial attributes

Link: https://www.kaggle.com/datasets/lorenzozoppelletto/financial-risk-for-loan-approval

Import Dataset

Install packages here:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("caret")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

data <- read.csv("Loan.csv")
head(data)

##   ApplicationDate Age AnnualIncome CreditScore EmploymentStatus EducationLevel
## 1      2018-01-01  45        39948         617         Employed         Master
## 2      2018-01-02  38        39709         628         Employed      Associate
## 3      2018-01-03  47        40724         570         Employed       Bachelor
## 4      2018-01-04  58        69084         545         Employed    High School
## 5      2018-01-05  37       103264         594         Employed      Associate
## 6      2018-01-06  37       178310         626    Self-Employed         Master
##   Experience LoanAmount LoanDuration MaritalStatus NumberOfDependents
## 1         22      13152           48       Married                  2
## 2         15      26045           48        Single                  1
## 3         26      17627           36       Married                  2
## 4         34      37898           96        Single                  1
## 5         17       9184           36       Married                  1
## 6         16      15433           72       Married                  0
##   HomeOwnershipStatus MonthlyDebtPayments CreditCardUtilizationRate
## 1                 Own                 183                0.35441792
## 2            Mortgage                 496                0.08782697
## 3                Rent                 902                0.13741410
## 4            Mortgage                 755                0.26758714
## 5            Mortgage                 274                0.32053532
## 6                Rent                 732                0.10221134
##   NumberOfOpenCreditLines NumberOfCreditInquiries DebtToIncomeRatio
## 1                       1                       2        0.35833560
## 2                       5                       3        0.33027367
## 3                       2                       0        0.24472911
## 4                       2                       1        0.43624427
## 5                       0                       0        0.07888421
## 6                       5                       1        0.25936640
##   BankruptcyHistory        LoanPurpose PreviousLoanDefaults PaymentHistory
## 1                 0               Home                    0             29
## 2                 0 Debt Consolidation                    0             21
## 3                 0          Education                    0             20
## 4                 0               Home                    0             27
## 5                 0 Debt Consolidation                    0             26
## 6                 0 Debt Consolidation                    1             16
##   LengthOfCreditHistory SavingsAccountBalance CheckingAccountBalance
## 1                     9                  7632                   1202
## 2                     9                  4627                   3460
## 3                    22                   886                    895
## 4                    10                  1675                   1217
## 5                    27                  1555                   4981
## 6                    19                  2118                   1223
##   TotalAssets TotalLiabilities MonthlyIncome UtilityBillsPaymentHistory
## 1      146111            19183      3329.000                  0.7249720
## 2       53204             9595      3309.083                  0.9351321
## 3       25176           128874      3393.667                  0.8722406
## 4      104822             5370      5757.000                  0.8961547
## 5      244305            17286      8605.333                  0.9413687
## 6       67914            40843     14859.167                  0.7560794
##   JobTenure NetWorth BaseInterestRate InterestRate MonthlyLoanPayment
## 1        11   126928         0.199652    0.2275896           419.8060
## 2         3    43609         0.207045    0.2010771           794.0542
## 3         6     5205         0.217627    0.2125480           666.4067
## 4         5    99452         0.300398    0.3009108          1047.5070
## 5         5   227019         0.197184    0.1759902           330.1791
## 6         5    27071         0.217433    0.2176012           385.5771
##   TotalDebtToIncomeRatio LoanApproved RiskScore
## 1             0.18107720            0        49
## 2             0.38985245            0        52
## 3             0.46215697            0        52
## 4             0.31309831            0        54
## 5             0.07020985            1        36
## 6             0.07521129            1        44

Load Libraries:

install.packages(c("tidyverse","ggplot2", "dplyr", "skimr", "corrplot", "knitr"))

## Installing packages into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'ggplot2' successfully unpacked and MD5 sums checked
## package 'dplyr' successfully unpacked and MD5 sums checked
## package 'skimr' successfully unpacked and MD5 sums checked
## package 'corrplot' successfully unpacked and MD5 sums checked
## package 'knitr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(skimr)
library(corrplot)

## corrplot 0.95 loaded

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)

Data Overview

Structure of data
Summary of data
Data quality check

# Structure of the dataset
str(data)

## 'data.frame':    20000 obs. of  36 variables:
##  $ ApplicationDate           : chr  "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" ...
##  $ Age                       : int  45 38 47 58 37 37 58 49 34 46 ...
##  $ AnnualIncome              : int  39948 39709 40724 69084 103264 178310 51250 97345 116841 40615 ...
##  $ CreditScore               : int  617 628 570 545 594 626 564 516 603 612 ...
##  $ EmploymentStatus          : chr  "Employed" "Employed" "Employed" "Employed" ...
##  $ EducationLevel            : chr  "Master" "Associate" "Bachelor" "High School" ...
##  $ Experience                : int  22 15 26 34 17 16 39 23 12 19 ...
##  $ LoanAmount                : int  13152 26045 17627 37898 9184 15433 12741 19634 55353 25443 ...
##  $ LoanDuration              : int  48 48 36 96 36 72 48 12 60 12 ...
##  $ MaritalStatus             : chr  "Married" "Single" "Married" "Single" ...
##  $ NumberOfDependents        : int  2 1 2 1 1 0 0 5 5 4 ...
##  $ HomeOwnershipStatus       : chr  "Own" "Mortgage" "Rent" "Mortgage" ...
##  $ MonthlyDebtPayments       : int  183 496 902 755 274 732 337 288 638 704 ...
##  $ CreditCardUtilizationRate : num  0.3544 0.0878 0.1374 0.2676 0.3205 ...
##  $ NumberOfOpenCreditLines   : int  1 5 2 2 0 5 6 5 3 3 ...
##  $ NumberOfCreditInquiries   : int  2 3 0 1 0 1 1 0 0 2 ...
##  $ DebtToIncomeRatio         : num  0.3583 0.3303 0.2447 0.4362 0.0789 ...
##  $ BankruptcyHistory         : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ LoanPurpose               : chr  "Home" "Debt Consolidation" "Education" "Home" ...
##  $ PreviousLoanDefaults      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ PaymentHistory            : int  29 21 20 27 26 16 21 19 25 23 ...
##  $ LengthOfCreditHistory     : int  9 9 22 10 27 19 18 11 29 10 ...
##  $ SavingsAccountBalance     : int  7632 4627 886 1675 1555 2118 5161 781 1157 1028 ...
##  $ CheckingAccountBalance    : int  1202 3460 895 1217 4981 1223 1735 74 708 446 ...
##  $ TotalAssets               : int  146111 53204 25176 104822 244305 67914 65624 50177 29632 129664 ...
##  $ TotalLiabilities          : int  19183 9595 128874 5370 17286 40843 43894 11556 49940 12852 ...
##  $ MonthlyIncome             : num  3329 3309 3394 5757 8605 ...
##  $ UtilityBillsPaymentHistory: num  0.725 0.935 0.872 0.896 0.941 ...
##  $ JobTenure                 : int  11 3 6 5 5 5 5 5 3 3 ...
##  $ NetWorth                  : int  126928 43609 5205 99452 227019 27071 21730 38621 7711 116812 ...
##  $ BaseInterestRate          : num  0.2 0.207 0.218 0.3 0.197 ...
##  $ InterestRate              : num  0.228 0.201 0.213 0.301 0.176 ...
##  $ MonthlyLoanPayment        : num  420 794 666 1048 330 ...
##  $ TotalDebtToIncomeRatio    : num  0.1811 0.3899 0.4622 0.3131 0.0702 ...
##  $ LoanApproved              : int  0 0 0 0 1 1 0 1 0 0 ...
##  $ RiskScore                 : num  49 52 52 54 36 44 50 42.4 61 53 ...

# Summary statistics
summary(data)

##  ApplicationDate         Age         AnnualIncome     CreditScore   
##  Length:20000       Min.   :18.00   Min.   : 15000   Min.   :343.0  
##  Class :character   1st Qu.:32.00   1st Qu.: 31679   1st Qu.:540.0  
##  Mode  :character   Median :40.00   Median : 48566   Median :578.0  
##                     Mean   :39.75   Mean   : 59162   Mean   :571.6  
##                     3rd Qu.:48.00   3rd Qu.: 74391   3rd Qu.:609.0  
##                     Max.   :80.00   Max.   :485341   Max.   :712.0  
##  EmploymentStatus   EducationLevel       Experience      LoanAmount    
##  Length:20000       Length:20000       Min.   : 0.00   Min.   :  3674  
##  Class :character   Class :character   1st Qu.: 9.00   1st Qu.: 15575  
##  Mode  :character   Mode  :character   Median :17.00   Median : 21915  
##                                        Mean   :17.52   Mean   : 24883  
##                                        3rd Qu.:25.00   3rd Qu.: 30835  
##                                        Max.   :61.00   Max.   :184732  
##   LoanDuration    MaritalStatus      NumberOfDependents HomeOwnershipStatus
##  Min.   : 12.00   Length:20000       Min.   :0.000      Length:20000       
##  1st Qu.: 36.00   Class :character   1st Qu.:0.000      Class :character   
##  Median : 48.00   Mode  :character   Median :1.000      Mode  :character   
##  Mean   : 54.06                      Mean   :1.517                         
##  3rd Qu.: 72.00                      3rd Qu.:2.000                         
##  Max.   :120.00                      Max.   :5.000                         
##  MonthlyDebtPayments CreditCardUtilizationRate NumberOfOpenCreditLines
##  Min.   :  50.0      Min.   :0.0009745         Min.   : 0.000         
##  1st Qu.: 286.0      1st Qu.:0.1607936         1st Qu.: 2.000         
##  Median : 402.0      Median :0.2666726         Median : 3.000         
##  Mean   : 454.3      Mean   :0.2863810         Mean   : 3.023         
##  3rd Qu.: 564.0      3rd Qu.:0.3906339         3rd Qu.: 4.000         
##  Max.   :2919.0      Max.   :0.9173801         Max.   :13.000         
##  NumberOfCreditInquiries DebtToIncomeRatio BankruptcyHistory LoanPurpose       
##  Min.   :0.000           Min.   :0.00172   Min.   :0.0000    Length:20000      
##  1st Qu.:0.000           1st Qu.:0.16103   1st Qu.:0.0000    Class :character  
##  Median :1.000           Median :0.26445   Median :0.0000    Mode  :character  
##  Mean   :0.993           Mean   :0.28573   Mean   :0.0524                      
##  3rd Qu.:2.000           3rd Qu.:0.39033   3rd Qu.:0.0000                      
##  Max.   :7.000           Max.   :0.90225   Max.   :1.0000                      
##  PreviousLoanDefaults PaymentHistory  LengthOfCreditHistory
##  Min.   :0.0000       Min.   : 8.00   Min.   : 1.00        
##  1st Qu.:0.0000       1st Qu.:21.00   1st Qu.: 8.00        
##  Median :0.0000       Median :24.00   Median :15.00        
##  Mean   :0.1000       Mean   :23.99   Mean   :14.96        
##  3rd Qu.:0.0000       3rd Qu.:27.00   3rd Qu.:22.00        
##  Max.   :1.0000       Max.   :45.00   Max.   :29.00        
##  SavingsAccountBalance CheckingAccountBalance  TotalAssets     
##  Min.   :    73        Min.   :   24          Min.   :   2098  
##  1st Qu.:  1542        1st Qu.:  551          1st Qu.:  31180  
##  Median :  2986        Median : 1116          Median :  60699  
##  Mean   :  4946        Mean   : 1783          Mean   :  96964  
##  3rd Qu.:  5873        3rd Qu.: 2126          3rd Qu.: 117405  
##  Max.   :200089        Max.   :52572          Max.   :2619627  
##  TotalLiabilities  MonthlyIncome   UtilityBillsPaymentHistory   JobTenure     
##  Min.   :    372   Min.   : 1250   Min.   :0.2592             Min.   : 0.000  
##  1st Qu.:  11197   1st Qu.: 2630   1st Qu.:0.7274             1st Qu.: 3.000  
##  Median :  22203   Median : 4035   Median :0.8210             Median : 5.000  
##  Mean   :  36252   Mean   : 4892   Mean   :0.7999             Mean   : 5.003  
##  3rd Qu.:  43147   3rd Qu.: 6163   3rd Qu.:0.8923             3rd Qu.: 6.000  
##  Max.   :1417302   Max.   :25000   Max.   :0.9994             Max.   :16.000  
##     NetWorth       BaseInterestRate  InterestRate    MonthlyLoanPayment
##  Min.   :   1000   Min.   :0.1301   Min.   :0.1133   Min.   :   97.03  
##  1st Qu.:   8735   1st Qu.:0.2139   1st Qu.:0.2091   1st Qu.:  493.76  
##  Median :  32856   Median :0.2362   Median :0.2354   Median :  728.51  
##  Mean   :  72294   Mean   :0.2391   Mean   :0.2391   Mean   :  911.61  
##  3rd Qu.:  88826   3rd Qu.:0.2615   3rd Qu.:0.2655   3rd Qu.: 1112.77  
##  Max.   :2603208   Max.   :0.4050   Max.   :0.4468   Max.   :10892.63  
##  TotalDebtToIncomeRatio  LoanApproved     RiskScore    
##  Min.   :0.01604        Min.   :0.000   Min.   :28.80  
##  1st Qu.:0.17969        1st Qu.:0.000   1st Qu.:46.00  
##  Median :0.30271        Median :0.000   Median :52.00  
##  Mean   :0.40218        Mean   :0.239   Mean   :50.77  
##  3rd Qu.:0.50921        3rd Qu.:0.000   3rd Qu.:56.00  
##  Max.   :4.64766        Max.   :1.000   Max.   :84.00

# Data quality check
dim(data)

## [1] 20000    36

# Get number of rows and columns
any(is.na(data))

## [1] FALSE

# Get number of missing values for each columns
kable(colSums(is.na(data)))

	x
ApplicationDate	0
Age	0
AnnualIncome	0
CreditScore	0
EmploymentStatus	0
EducationLevel	0
Experience	0
LoanAmount	0
LoanDuration	0
MaritalStatus	0
NumberOfDependents	0
HomeOwnershipStatus	0
MonthlyDebtPayments	0
CreditCardUtilizationRate	0
NumberOfOpenCreditLines	0
NumberOfCreditInquiries	0
DebtToIncomeRatio	0
BankruptcyHistory	0
LoanPurpose	0
PreviousLoanDefaults	0
PaymentHistory	0
LengthOfCreditHistory	0
SavingsAccountBalance	0
CheckingAccountBalance	0
TotalAssets	0
TotalLiabilities	0
MonthlyIncome	0
UtilityBillsPaymentHistory	0
JobTenure	0
NetWorth	0
BaseInterestRate	0
InterestRate	0
MonthlyLoanPayment	0
TotalDebtToIncomeRatio	0
LoanApproved	0
RiskScore	0

Data Pre-Processing/Cleaning

The dataset includes the following columns:
1. ApplicationDate: Loan application date * ApplicationDate to be dropped and used as an account ID since it is generated as one day one account

2. Age: Applicant’s age
3. AnnualIncome: Yearly income
4. CreditScore: Creditworthiness score * Mirrors real-world credit scoring systems like FICO, where: * 300–579: Poor * 580–669: Fair * 670–739: Good * 740–799: Very Good * 800–850: Excellent

5. EmploymentStatus: Job situation
6. EducationLevel: Highest education attained
7. Experience: Work experience in Years * Rename Experience to WorkExperience

8. LoanAmount: Requested loan size
9. LoanDuration: Loan repayment period in years
10. MaritalStatus: Applicant’s marital state
11. NumberOfDependents: Number of dependents
12. HomeOwnershipStatus: Homeownership type
13. MonthlyDebtPayments: Monthly debt obligations
14. CreditCardUtilizationRate: Credit card usage percentage
15. NumberOfOpenCreditLines: Active credit lines
16. NumberOfCreditInquiries: Credit checks count
17. DebtToIncomeRatio: Debt to income proportion
18. BankruptcyHistory: Bankruptcy records
19. LoanPurpose: Reason for loan
20. PreviousLoanDefaults: Prior loan defaults
21. PaymentHistory: Past payment behavior
22. LengthOfCreditHistory: Credit history duration
23. SavingsAccountBalance: Savings account amount
24. CheckingAccountBalance: Checking account funds
25. TotalAssets: Total owned assets
26. TotalLiabilities: Total owed debts
27. MonthlyIncome: Income per month
28. UtilityBillsPaymentHistory: Utility payment record
29. JobTenure: Job duration
30. NetWorth: Total financial worth
31. BaseInterestRate: Starting interest rate
32. InterestRate: Applied interest rate
33. MonthlyLoanPayment: Monthly loan payment
34. TotalDebtToIncomeRatio: Total debt against income
35. LoanApproved: Loan approval status
36. RiskScore: Risk assessment score * A higher Risk Score indicates a more significant risk.

# 1. Distribution of Loan Approved
ggplot(data, aes(x = factor(LoanApproved))) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "LoanApproved Distribution",
       x = "LoanApproved",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Data Cleaning
#1. Drop applicationDate and changed it to AppID as Application ID
#2. Create another variable CreditScoreGroup to follow FICO

data$AppID <- gsub("-", "", data$ApplicationDate)
data$ApplicationDate <- NULL
head(data)

##   Age AnnualIncome CreditScore EmploymentStatus EducationLevel Experience
## 1  45        39948         617         Employed         Master         22
## 2  38        39709         628         Employed      Associate         15
## 3  47        40724         570         Employed       Bachelor         26
## 4  58        69084         545         Employed    High School         34
## 5  37       103264         594         Employed      Associate         17
## 6  37       178310         626    Self-Employed         Master         16
##   LoanAmount LoanDuration MaritalStatus NumberOfDependents HomeOwnershipStatus
## 1      13152           48       Married                  2                 Own
## 2      26045           48        Single                  1            Mortgage
## 3      17627           36       Married                  2                Rent
## 4      37898           96        Single                  1            Mortgage
## 5       9184           36       Married                  1            Mortgage
## 6      15433           72       Married                  0                Rent
##   MonthlyDebtPayments CreditCardUtilizationRate NumberOfOpenCreditLines
## 1                 183                0.35441792                       1
## 2                 496                0.08782697                       5
## 3                 902                0.13741410                       2
## 4                 755                0.26758714                       2
## 5                 274                0.32053532                       0
## 6                 732                0.10221134                       5
##   NumberOfCreditInquiries DebtToIncomeRatio BankruptcyHistory
## 1                       2        0.35833560                 0
## 2                       3        0.33027367                 0
## 3                       0        0.24472911                 0
## 4                       1        0.43624427                 0
## 5                       0        0.07888421                 0
## 6                       1        0.25936640                 0
##          LoanPurpose PreviousLoanDefaults PaymentHistory LengthOfCreditHistory
## 1               Home                    0             29                     9
## 2 Debt Consolidation                    0             21                     9
## 3          Education                    0             20                    22
## 4               Home                    0             27                    10
## 5 Debt Consolidation                    0             26                    27
## 6 Debt Consolidation                    1             16                    19
##   SavingsAccountBalance CheckingAccountBalance TotalAssets TotalLiabilities
## 1                  7632                   1202      146111            19183
## 2                  4627                   3460       53204             9595
## 3                   886                    895       25176           128874
## 4                  1675                   1217      104822             5370
## 5                  1555                   4981      244305            17286
## 6                  2118                   1223       67914            40843
##   MonthlyIncome UtilityBillsPaymentHistory JobTenure NetWorth BaseInterestRate
## 1      3329.000                  0.7249720        11   126928         0.199652
## 2      3309.083                  0.9351321         3    43609         0.207045
## 3      3393.667                  0.8722406         6     5205         0.217627
## 4      5757.000                  0.8961547         5    99452         0.300398
## 5      8605.333                  0.9413687         5   227019         0.197184
## 6     14859.167                  0.7560794         5    27071         0.217433
##   InterestRate MonthlyLoanPayment TotalDebtToIncomeRatio LoanApproved RiskScore
## 1    0.2275896           419.8060             0.18107720            0        49
## 2    0.2010771           794.0542             0.38985245            0        52
## 3    0.2125480           666.4067             0.46215697            0        52
## 4    0.3009108          1047.5070             0.31309831            0        54
## 5    0.1759902           330.1791             0.07020985            1        36
## 6    0.2176012           385.5771             0.07521129            1        44
##      AppID
## 1 20180101
## 2 20180102
## 3 20180103
## 4 20180104
## 5 20180105
## 6 20180106

#Mirrors real-world credit scoring systems like FICO, where:
#300–579: Poor
#580–669: Fair
#670–739: Good
#740–799: Very Good
#800–850: Excellent

data <- data %>%
  mutate(CreditScoreGroup = case_when(
    CreditScore >= 300 & CreditScore <= 579 ~ "Poor",
    CreditScore >= 580 & CreditScore <= 669 ~ "Fair",
    CreditScore >= 670 & CreditScore <= 739 ~ "Good",
    CreditScore >= 740 & CreditScore <= 799 ~ "Very Good",
    CreditScore >= 800 & CreditScore <= 850 ~ "Excellent",
    TRUE ~ NA_character_ #for other cases
  ))

head(data)

##   Age AnnualIncome CreditScore EmploymentStatus EducationLevel Experience
## 1  45        39948         617         Employed         Master         22
## 2  38        39709         628         Employed      Associate         15
## 3  47        40724         570         Employed       Bachelor         26
## 4  58        69084         545         Employed    High School         34
## 5  37       103264         594         Employed      Associate         17
## 6  37       178310         626    Self-Employed         Master         16
##   LoanAmount LoanDuration MaritalStatus NumberOfDependents HomeOwnershipStatus
## 1      13152           48       Married                  2                 Own
## 2      26045           48        Single                  1            Mortgage
## 3      17627           36       Married                  2                Rent
## 4      37898           96        Single                  1            Mortgage
## 5       9184           36       Married                  1            Mortgage
## 6      15433           72       Married                  0                Rent
##   MonthlyDebtPayments CreditCardUtilizationRate NumberOfOpenCreditLines
## 1                 183                0.35441792                       1
## 2                 496                0.08782697                       5
## 3                 902                0.13741410                       2
## 4                 755                0.26758714                       2
## 5                 274                0.32053532                       0
## 6                 732                0.10221134                       5
##   NumberOfCreditInquiries DebtToIncomeRatio BankruptcyHistory
## 1                       2        0.35833560                 0
## 2                       3        0.33027367                 0
## 3                       0        0.24472911                 0
## 4                       1        0.43624427                 0
## 5                       0        0.07888421                 0
## 6                       1        0.25936640                 0
##          LoanPurpose PreviousLoanDefaults PaymentHistory LengthOfCreditHistory
## 1               Home                    0             29                     9
## 2 Debt Consolidation                    0             21                     9
## 3          Education                    0             20                    22
## 4               Home                    0             27                    10
## 5 Debt Consolidation                    0             26                    27
## 6 Debt Consolidation                    1             16                    19
##   SavingsAccountBalance CheckingAccountBalance TotalAssets TotalLiabilities
## 1                  7632                   1202      146111            19183
## 2                  4627                   3460       53204             9595
## 3                   886                    895       25176           128874
## 4                  1675                   1217      104822             5370
## 5                  1555                   4981      244305            17286
## 6                  2118                   1223       67914            40843
##   MonthlyIncome UtilityBillsPaymentHistory JobTenure NetWorth BaseInterestRate
## 1      3329.000                  0.7249720        11   126928         0.199652
## 2      3309.083                  0.9351321         3    43609         0.207045
## 3      3393.667                  0.8722406         6     5205         0.217627
## 4      5757.000                  0.8961547         5    99452         0.300398
## 5      8605.333                  0.9413687         5   227019         0.197184
## 6     14859.167                  0.7560794         5    27071         0.217433
##   InterestRate MonthlyLoanPayment TotalDebtToIncomeRatio LoanApproved RiskScore
## 1    0.2275896           419.8060             0.18107720            0        49
## 2    0.2010771           794.0542             0.38985245            0        52
## 3    0.2125480           666.4067             0.46215697            0        52
## 4    0.3009108          1047.5070             0.31309831            0        54
## 5    0.1759902           330.1791             0.07020985            1        36
## 6    0.2176012           385.5771             0.07521129            1        44
##      AppID CreditScoreGroup
## 1 20180101             Fair
## 2 20180102             Fair
## 3 20180103             Poor
## 4 20180104             Poor
## 5 20180105             Fair
## 6 20180106             Fair

unique_values <- unique(data$CreditScoreGroup)
unique_values

## [1] "Fair" "Poor" "Good"

#Rearrange columns
data <- data %>%
  select(AppID, Age, AnnualIncome,  CreditScore, CreditScoreGroup, everything())
head(data)

##      AppID Age AnnualIncome CreditScore CreditScoreGroup EmploymentStatus
## 1 20180101  45        39948         617             Fair         Employed
## 2 20180102  38        39709         628             Fair         Employed
## 3 20180103  47        40724         570             Poor         Employed
## 4 20180104  58        69084         545             Poor         Employed
## 5 20180105  37       103264         594             Fair         Employed
## 6 20180106  37       178310         626             Fair    Self-Employed
##   EducationLevel Experience LoanAmount LoanDuration MaritalStatus
## 1         Master         22      13152           48       Married
## 2      Associate         15      26045           48        Single
## 3       Bachelor         26      17627           36       Married
## 4    High School         34      37898           96        Single
## 5      Associate         17       9184           36       Married
## 6         Master         16      15433           72       Married
##   NumberOfDependents HomeOwnershipStatus MonthlyDebtPayments
## 1                  2                 Own                 183
## 2                  1            Mortgage                 496
## 3                  2                Rent                 902
## 4                  1            Mortgage                 755
## 5                  1            Mortgage                 274
## 6                  0                Rent                 732
##   CreditCardUtilizationRate NumberOfOpenCreditLines NumberOfCreditInquiries
## 1                0.35441792                       1                       2
## 2                0.08782697                       5                       3
## 3                0.13741410                       2                       0
## 4                0.26758714                       2                       1
## 5                0.32053532                       0                       0
## 6                0.10221134                       5                       1
##   DebtToIncomeRatio BankruptcyHistory        LoanPurpose PreviousLoanDefaults
## 1        0.35833560                 0               Home                    0
## 2        0.33027367                 0 Debt Consolidation                    0
## 3        0.24472911                 0          Education                    0
## 4        0.43624427                 0               Home                    0
## 5        0.07888421                 0 Debt Consolidation                    0
## 6        0.25936640                 0 Debt Consolidation                    1
##   PaymentHistory LengthOfCreditHistory SavingsAccountBalance
## 1             29                     9                  7632
## 2             21                     9                  4627
## 3             20                    22                   886
## 4             27                    10                  1675
## 5             26                    27                  1555
## 6             16                    19                  2118
##   CheckingAccountBalance TotalAssets TotalLiabilities MonthlyIncome
## 1                   1202      146111            19183      3329.000
## 2                   3460       53204             9595      3309.083
## 3                    895       25176           128874      3393.667
## 4                   1217      104822             5370      5757.000
## 5                   4981      244305            17286      8605.333
## 6                   1223       67914            40843     14859.167
##   UtilityBillsPaymentHistory JobTenure NetWorth BaseInterestRate InterestRate
## 1                  0.7249720        11   126928         0.199652    0.2275896
## 2                  0.9351321         3    43609         0.207045    0.2010771
## 3                  0.8722406         6     5205         0.217627    0.2125480
## 4                  0.8961547         5    99452         0.300398    0.3009108
## 5                  0.9413687         5   227019         0.197184    0.1759902
## 6                  0.7560794         5    27071         0.217433    0.2176012
##   MonthlyLoanPayment TotalDebtToIncomeRatio LoanApproved RiskScore
## 1           419.8060             0.18107720            0        49
## 2           794.0542             0.38985245            0        52
## 3           666.4067             0.46215697            0        52
## 4          1047.5070             0.31309831            0        54
## 5           330.1791             0.07020985            1        36
## 6           385.5771             0.07521129            1        44

Exploratory Data Analysis (EDA)

#RiskScore to group if we want to do a classification model
summary(data$RiskScore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.80   46.00   52.00   50.77   56.00   84.00

hist(data$RiskScore, breaks = 10, main = "RiskScore Distribution")

#  Distribution of RiskScore

ggplot(data, aes(x = RiskScore)) +
  geom_histogram(binwidth =  2, color = "black", alpha = 0.9) +
  labs(title = "Histogram of RiskScore",
        x = "Risk Score",
        y = "Count") +
  theme_minimal()

# Select numeric columns only
numeric_cols <- data %>% select(where(is.numeric))

# Compute correlation matrix (use complete cases to avoid NAs)
cor_matrix <- cor(numeric_cols, use = "complete.obs")

# Visualize correlation heatmap
corrplot(cor_matrix, method = "color", type = "lower",
         tl.cex = 0.5, tl.col = "black",
         addCoef.col = "black", number.cex = 0.4,
         title = "Correlation Heatmap with LoanApproved",
         mar = c(0,0,1,0))

Bivariate Categorical Analysis:

Loan approval vs. credit score group

ggplot(data, aes(x = CreditScoreGroup, fill = factor(LoanApproved))) +
  geom_bar(position = "fill") +
  labs(title = "Loan Approval by Credit Score Group",
       y = "Proportion", x = "Credit Score Group") +
  theme_minimal()

Relationship Between Income, Risk, and Loan Approval:

AnnualIncome vs LoanApproved

ggplot(data, aes(x = factor(LoanApproved), y = AnnualIncome)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Annual Income by Loan Approval", x = "Loan Approved", y = "Annual Income")

Relationship Between Income, Risk, and Loan Approval:

RiskScore vs. CreditScore, colored by LoanApproved

ggplot(data, aes(x = CreditScore, y = RiskScore, color = factor(LoanApproved))) +
  geom_point(alpha = 0.6) +
  labs(title = "Credit Score vs. Risk Score by Loan Approval", x = "Credit Score", y = "Risk Score") +
  theme_minimal()

Explore Potential Predictive Features

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

featurePlot(x = data[, c("Age", "AnnualIncome", "CreditScore", "RiskScore")],
            y = as.factor(data$LoanApproved),
            plot = "density",
            auto.key = list(columns = 2))

Data Modeling (Machine learning)

To evaluate the model’s performance, we split the datasets into two parts:

80% for Training - used to fit the logistic regression model.
20% for Testing - used to evaluate the model on unseen data.

This 80/20 split is a common and widely accepted practice in machine learning to ensure the model generalizes well and does not overfit.

For Binary Classification - We use LoanApproved as target

For Regression - We use RiskScore as target

#Convert LoanApproved to factor. (To treat the numeric values as a binary categorical outcome)
data$LoanApproved <- factor(data$LoanApproved, labels = c("Loan Not Approved", "Loan Approved"))
#Original dataset
table(data$LoanApproved)

## 
## Loan Not Approved     Loan Approved 
##             15220              4780

#Split the dataset to Training and Testing
install.packages("caTools")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'caTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(caTools)

Binary Classification

set.seed(123) #for reproducibility- the analysis will produce the same results when running it multiple times
split <- sample.split(data$LoanApproved, SplitRatio = 0.8)
train_data <- subset(data, split == TRUE)
test_data <- subset(data, split == FALSE)
kable(summary(train_data$LoanApproved))

	x
Loan Not Approved	12176
Loan Approved	3824

kable(summary(test_data$LoanApproved))

	x
Loan Not Approved	3044
Loan Approved	956

# Standardized the numeric features on training set to improve model performance & comparability for Binary Classification Models
features_to_scale <- c( "LoanAmount", "TotalAssets", "MonthlyIncome", "InterestRate", "TotalDebtToIncomeRatio")
train_data_scaled <- train_data
train_data_scaled[features_to_scale] <- scale(train_data[features_to_scale])

test_data_scaled <- test_data
test_data_scaled[features_to_scale] <- scale(test_data[features_to_scale])

Model 1: Logistic Regression

We are using Logistic Regression to model the probability that a loan is approved because the target variable, LoanApproved, is binary and it has only two possible outcomes:

1 = Approved
0 = Not Approved

Logistic Regression is a classification algorithm that models the relationship between a set of input variables (features) and a binary outcome. It predicts the probability of the loan being approved given the value of the input variables.

Target variable:

LoanApproved

Features used in the model:

LoanAmount
TotalAssets
MonthlyIncome
InterestRate
TotalDebtToIncomeRatio

These features were selected based on their correlation with the target variable, as shown in the heatmap analysis.

#  model 1 for binary classification
model_class_1 <- glm(LoanApproved ~ LoanAmount + TotalAssets + MonthlyIncome +
                            InterestRate + TotalDebtToIncomeRatio,
                      data = train_data_scaled, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_class_1)

## 
## Call:
## glm(formula = LoanApproved ~ LoanAmount + TotalAssets + MonthlyIncome + 
##     InterestRate + TotalDebtToIncomeRatio, family = binomial, 
##     data = train_data_scaled)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.42235    0.10676  -41.42   <2e-16 ***
## LoanAmount             -1.11394    0.06856  -16.25   <2e-16 ***
## TotalAssets             1.31002    0.04042   32.41   <2e-16 ***
## MonthlyIncome           2.50456    0.08074   31.02   <2e-16 ***
## InterestRate           -1.87860    0.05237  -35.87   <2e-16 ***
## TotalDebtToIncomeRatio -3.52042    0.17660  -19.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17597.6  on 15999  degrees of freedom
## Residual deviance:  5844.3  on 15994  degrees of freedom
## AIC: 5856.3
## 
## Number of Fisher Scoring iterations: 8

# Predict probabilities
pred_probs <- predict(model_class_1, newdata = test_data_scaled, type = "response")

# Predict classes based on a threshold (e.g., 0.5)
pred_classes <- ifelse(pred_probs > 0.5,"Loan Approved","Loan Not Approved" )
pred_classes <- factor(pred_classes, levels = levels(test_data_scaled$LoanApproved))

# Create a confusion matrix
confusion_matrx_1 <- confusionMatrix(factor(pred_classes), factor(test_data_scaled$LoanApproved))
confusion_matrx_1

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Loan Not Approved Loan Approved
##   Loan Not Approved              2912           178
##   Loan Approved                   132           778
##                                            
##                Accuracy : 0.9225           
##                  95% CI : (0.9138, 0.9306) 
##     No Information Rate : 0.761            
##     P-Value [Acc > NIR] : < 2e-16          
##                                            
##                   Kappa : 0.7834           
##                                            
##  Mcnemar's Test P-Value : 0.01059          
##                                            
##             Sensitivity : 0.9566           
##             Specificity : 0.8138           
##          Pos Pred Value : 0.9424           
##          Neg Pred Value : 0.8549           
##              Prevalence : 0.7610           
##          Detection Rate : 0.7280           
##    Detection Prevalence : 0.7725           
##       Balanced Accuracy : 0.8852           
##                                            
##        'Positive' Class : Loan Not Approved
##

install.packages("pROC")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'pROC' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'pROC'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\puter\AppData\Local\R\win-library\4.4\00LOCK\pROC\libs\x64\pROC.dll to
## C:\Users\puter\AppData\Local\R\win-library\4.4\pROC\libs\x64\pROC.dll:
## Permission denied

## Warning: restored 'pROC'

## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Predict probabilities on the test set
pred_probs <- predict(model_class_1, newdata = test_data_scaled, type = "response")

# Generate ROC curve
roc_obj <- roc(test_data_scaled$LoanApproved, pred_probs, levels = c("Loan Approved","Loan Not Approved"))

## Setting direction: controls > cases

# Plot ROC
plot(roc_obj, col = "blue", main = "ROC Curve")

# AUC
auc_value <- auc(roc_obj)
print(paste("AUC:", round(auc_value, 3)))

## [1] "AUC: 0.974"

Model 2: Decision Tree

We are using Decision Tree to model the probability that a loan is approved because the target variable, LoanApproved, is binary and it has only two possible outcomes:

1 = Approved
0 = Not Approved

Decision Tree is a classification algorithm that models the relationship between a set of input variables (features) and a categorical outcome, such as loan approval status. It splits the data into branches based on feature values, creating a tree-like structure that leads to a decision. In binary classification (e.g., “Loan Approved” vs. “Loan Not Approved”), it learns patterns in the features to predict the class label, not just a probability.

Target variable:

LoanApproved

Features used in the model:

LoanAmount
TotalAssets
MonthlyIncome
InterestRate
TotalDebtToIncomeRatio

These features were selected based on their correlation with the target variable, as shown in the heatmap analysis.

# Install library for decision tree algorithm
install.packages("rpart")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'rpart' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'rpart'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\puter\AppData\Local\R\win-library\4.4\00LOCK\rpart\libs\x64\rpart.dll
## to C:\Users\puter\AppData\Local\R\win-library\4.4\rpart\libs\x64\rpart.dll:
## Permission denied

## Warning: restored 'rpart'

## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

install.packages("rpart.plot")  # Optional: for visualization

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'rpart.plot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(rpart)
library(rpart.plot)

#  model 2 for binary classification
model_class_2 <- rpart(LoanApproved ~ LoanAmount + TotalAssets + MonthlyIncome +
                            InterestRate + TotalDebtToIncomeRatio,
                      data = train_data_scaled, method = "class")

summary(model_class_2)

## Call:
## rpart(formula = LoanApproved ~ LoanAmount + TotalAssets + MonthlyIncome + 
##     InterestRate + TotalDebtToIncomeRatio, data = train_data_scaled, 
##     method = "class")
##   n= 16000 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.37709205      0 1.0000000 1.0000000 0.01410696
## 2 0.04589435      1 0.6229079 0.6555962 0.01202413
## 3 0.01908996      3 0.5311192 0.5462866 0.01114475
## 4 0.01725941      4 0.5120293 0.5431485 0.01111749
## 5 0.01542887      7 0.4602510 0.5109833 0.01083081
## 6 0.01000000      8 0.4448222 0.4864017 0.01060238
## 
## Variable importance
## TotalDebtToIncomeRatio          MonthlyIncome           InterestRate 
##                     57                     28                     10 
##             LoanAmount            TotalAssets 
##                      3                      2 
## 
## Node number 1: 16000 observations,    complexity param=0.3770921
##   predicted class=Loan Not Approved  expected loss=0.239  P(node) =1
##     class counts: 12176  3824
##    probabilities: 0.761 0.239 
##   left son=2 (12178 obs) right son=3 (3822 obs)
##   Primary splits:
##       TotalDebtToIncomeRatio < -0.6732645  to the right, improve=2030.5030, (0 missing)
##       MonthlyIncome          < 0.4152906   to the left,  improve=1788.9430, (0 missing)
##       InterestRate           < -0.2290712  to the right, improve= 422.4311, (0 missing)
##       LoanAmount             < -0.3905908  to the right, improve= 274.7782, (0 missing)
##       TotalAssets            < 2.107242    to the left,  improve= 156.7886, (0 missing)
##   Surrogate splits:
##       MonthlyIncome < 0.5842954   to the left,  agree=0.859, adj=0.412, (0 split)
##       LoanAmount    < -1.350056   to the right, agree=0.763, adj=0.006, (0 split)
## 
## Node number 2: 12178 observations,    complexity param=0.01725941
##   predicted class=Loan Not Approved  expected loss=0.09788143  P(node) =0.761125
##     class counts: 10986  1192
##    probabilities: 0.902 0.098 
##   left son=4 (8546 obs) right son=5 (3632 obs)
##   Primary splits:
##       TotalDebtToIncomeRatio < -0.354714   to the right, improve=251.81920, (0 missing)
##       MonthlyIncome          < 0.0375905   to the left,  improve=211.73390, (0 missing)
##       InterestRate           < -0.8416135  to the right, improve=149.75920, (0 missing)
##       TotalAssets            < 2.129869    to the left,  improve=146.61160, (0 missing)
##       LoanAmount             < 0.4568591   to the right, improve= 31.43389, (0 missing)
##   Surrogate splits:
##       MonthlyIncome < -0.07292332 to the left,  agree=0.770, adj=0.230, (0 split)
##       LoanAmount    < -1.164882   to the right, agree=0.705, adj=0.011, (0 split)
##       TotalAssets   < -0.7779875  to the right, agree=0.702, adj=0.000, (0 split)
## 
## Node number 3: 3822 observations,    complexity param=0.04589435
##   predicted class=Loan Approved      expected loss=0.3113553  P(node) =0.238875
##     class counts:  1190  2632
##    probabilities: 0.311 0.689 
##   left son=6 (1846 obs) right son=7 (1976 obs)
##   Primary splits:
##       InterestRate           < -0.2238504  to the right, improve=210.898100, (0 missing)
##       MonthlyIncome          < 0.5706794   to the left,  improve=203.141600, (0 missing)
##       TotalDebtToIncomeRatio < -0.8485246  to the right, improve=177.069000, (0 missing)
##       TotalAssets            < 0.2208439   to the left,  improve= 47.521030, (0 missing)
##       LoanAmount             < -0.6595621  to the right, improve=  5.054868, (0 missing)
##   Surrogate splits:
##       LoanAmount             < -0.4034309  to the right, agree=0.574, adj=0.118, (0 split)
##       TotalAssets            < 1.816473    to the right, agree=0.523, adj=0.012, (0 split)
##       TotalDebtToIncomeRatio < -0.8171796  to the right, agree=0.521, adj=0.008, (0 split)
##       MonthlyIncome          < 3.770106    to the right, agree=0.519, adj=0.004, (0 split)
## 
## Node number 4: 8546 observations
##   predicted class=Loan Not Approved  expected loss=0.03159373  P(node) =0.534125
##     class counts:  8276   270
##    probabilities: 0.968 0.032 
## 
## Node number 5: 3632 observations,    complexity param=0.01725941
##   predicted class=Loan Not Approved  expected loss=0.2538546  P(node) =0.227
##     class counts:  2710   922
##    probabilities: 0.746 0.254 
##   left son=10 (2475 obs) right son=11 (1157 obs)
##   Primary splits:
##       InterestRate           < -0.5882456  to the right, improve=203.577800, (0 missing)
##       TotalAssets            < 1.047169    to the left,  improve=101.485800, (0 missing)
##       MonthlyIncome          < 0.1459273   to the left,  improve= 78.070170, (0 missing)
##       TotalDebtToIncomeRatio < -0.5248973  to the right, improve= 37.063880, (0 missing)
##       LoanAmount             < 0.4573817   to the right, improve=  4.511907, (0 missing)
##   Surrogate splits:
##       LoanAmount             < -1.339717   to the right, agree=0.683, adj=0.006, (0 split)
##       TotalDebtToIncomeRatio < -0.3553237  to the left,  agree=0.683, adj=0.004, (0 split)
##       MonthlyIncome          < 2.418231    to the left,  agree=0.682, adj=0.003, (0 split)
##       TotalAssets            < 6.512311    to the left,  agree=0.682, adj=0.002, (0 split)
## 
## Node number 6: 1846 observations,    complexity param=0.04589435
##   predicted class=Loan Approved      expected loss=0.4832069  P(node) =0.115375
##     class counts:   892   954
##    probabilities: 0.483 0.517 
##   left son=12 (1131 obs) right son=13 (715 obs)
##   Primary splits:
##       MonthlyIncome          < 1.174252    to the left,  improve=172.703400, (0 missing)
##       TotalDebtToIncomeRatio < -0.8551     to the right, improve=156.231700, (0 missing)
##       InterestRate           < 0.7787038   to the right, improve= 54.585080, (0 missing)
##       TotalAssets            < 0.2214176   to the left,  improve= 49.830070, (0 missing)
##       LoanAmount             < -0.1487182  to the left,  improve=  5.963172, (0 missing)
##   Surrogate splits:
##       TotalDebtToIncomeRatio < -0.8727742  to the right, agree=0.736, adj=0.317, (0 split)
##       LoanAmount             < 0.0217124   to the left,  agree=0.719, adj=0.274, (0 split)
##       InterestRate           < 3.048491    to the left,  agree=0.615, adj=0.007, (0 split)
## 
## Node number 7: 1976 observations
##   predicted class=Loan Approved      expected loss=0.1508097  P(node) =0.1235
##     class counts:   298  1678
##    probabilities: 0.151 0.849 
## 
## Node number 10: 2475 observations
##   predicted class=Loan Not Approved  expected loss=0.1393939  P(node) =0.1546875
##     class counts:  2130   345
##    probabilities: 0.861 0.139 
## 
## Node number 11: 1157 observations,    complexity param=0.01725941
##   predicted class=Loan Not Approved  expected loss=0.4987035  P(node) =0.0723125
##     class counts:   580   577
##    probabilities: 0.501 0.499 
##   left son=22 (759 obs) right son=23 (398 obs)
##   Primary splits:
##       MonthlyIncome          < 0.2026565   to the left,  improve=75.861990, (0 missing)
##       TotalAssets            < 0.3525364   to the left,  improve=44.057170, (0 missing)
##       InterestRate           < -1.307246   to the right, improve=38.160860, (0 missing)
##       TotalDebtToIncomeRatio < -0.58997    to the right, improve=21.561770, (0 missing)
##       LoanAmount             < -0.01348603 to the left,  improve= 5.874888, (0 missing)
##   Surrogate splits:
##       LoanAmount             < -0.06171124 to the left,  agree=0.774, adj=0.342, (0 split)
##       TotalAssets            < -0.7451932  to the right, agree=0.658, adj=0.005, (0 split)
##       InterestRate           < -1.937091   to the right, agree=0.658, adj=0.005, (0 split)
##       TotalDebtToIncomeRatio < -0.643018   to the right, agree=0.657, adj=0.003, (0 split)
## 
## Node number 12: 1131 observations,    complexity param=0.01908996
##   predicted class=Loan Not Approved  expected loss=0.3448276  P(node) =0.0706875
##     class counts:   741   390
##    probabilities: 0.655 0.345 
##   left son=24 (1020 obs) right son=25 (111 obs)
##   Primary splits:
##       TotalAssets            < 1.058802    to the left,  improve=57.664480, (0 missing)
##       MonthlyIncome          < 0.572567    to the left,  improve=38.278080, (0 missing)
##       InterestRate           < 0.7787038   to the right, improve=32.074390, (0 missing)
##       TotalDebtToIncomeRatio < -0.8485246  to the right, improve=29.997210, (0 missing)
##       LoanAmount             < -0.4383308  to the right, improve= 8.398313, (0 missing)
##   Surrogate splits:
##       MonthlyIncome < -0.7813846  to the right, agree=0.904, adj=0.018, (0 split)
## 
## Node number 13: 715 observations
##   predicted class=Loan Approved      expected loss=0.2111888  P(node) =0.0446875
##     class counts:   151   564
##    probabilities: 0.211 0.789 
## 
## Node number 22: 759 observations,    complexity param=0.01542887
##   predicted class=Loan Not Approved  expected loss=0.3675889  P(node) =0.0474375
##     class counts:   480   279
##    probabilities: 0.632 0.368 
##   left son=44 (616 obs) right son=45 (143 obs)
##   Primary splits:
##       TotalAssets            < 0.3154161   to the left,  improve=40.426830, (0 missing)
##       InterestRate           < -1.199029   to the right, improve=32.862340, (0 missing)
##       MonthlyIncome          < -0.2135006  to the left,  improve=12.154800, (0 missing)
##       TotalDebtToIncomeRatio < -0.4056125  to the right, improve=10.312910, (0 missing)
##       LoanAmount             < -0.6725888  to the right, improve= 3.949578, (0 missing)
## 
## Node number 23: 398 observations
##   predicted class=Loan Approved      expected loss=0.2512563  P(node) =0.024875
##     class counts:   100   298
##    probabilities: 0.251 0.749 
## 
## Node number 24: 1020 observations
##   predicted class=Loan Not Approved  expected loss=0.2921569  P(node) =0.06375
##     class counts:   722   298
##    probabilities: 0.708 0.292 
## 
## Node number 25: 111 observations
##   predicted class=Loan Approved      expected loss=0.1711712  P(node) =0.0069375
##     class counts:    19    92
##    probabilities: 0.171 0.829 
## 
## Node number 44: 616 observations
##   predicted class=Loan Not Approved  expected loss=0.288961  P(node) =0.0385
##     class counts:   438   178
##    probabilities: 0.711 0.289 
## 
## Node number 45: 143 observations
##   predicted class=Loan Approved      expected loss=0.2937063  P(node) =0.0089375
##     class counts:    42   101
##    probabilities: 0.294 0.706

# Predict classes directly from the decision tree
pred_classes <- predict(model_class_2, newdata = test_data_scaled, type = "class")

# Create confusion matrix for Decision Tree Model
confusion_matrx_2 <- confusionMatrix(pred_classes, test_data_scaled$LoanApproved)
confusion_matrx_2

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Loan Not Approved Loan Approved
##   Loan Not Approved              2891           302
##   Loan Approved                   153           654
##                                            
##                Accuracy : 0.8862           
##                  95% CI : (0.876, 0.8959)  
##     No Information Rate : 0.761            
##     P-Value [Acc > NIR] : < 2.2e-16        
##                                            
##                   Kappa : 0.6696           
##                                            
##  Mcnemar's Test P-Value : 3.967e-12        
##                                            
##             Sensitivity : 0.9497           
##             Specificity : 0.6841           
##          Pos Pred Value : 0.9054           
##          Neg Pred Value : 0.8104           
##              Prevalence : 0.7610           
##          Detection Rate : 0.7228           
##    Detection Prevalence : 0.7983           
##       Balanced Accuracy : 0.8169           
##                                            
##        'Positive' Class : Loan Not Approved
##

Compare Model Performance

# Extract metrics from confusion matrix 1
acc_1 <- confusion_matrx_1$overall["Accuracy"]
prec_1 <- confusion_matrx_1$byClass["Precision"]
rec_1 <- confusion_matrx_1$byClass["Recall"]
f1_1 <- confusion_matrx_1$byClass["F1"]

# Extract metrics from confusion matrix 2
acc_2 <- confusion_matrx_2$overall["Accuracy"]
prec_2 <- confusion_matrx_2$byClass["Precision"]
rec_2 <- confusion_matrx_2$byClass["Recall"]
f1_2 <- confusion_matrx_2$byClass["F1"]

# Combine into a data frame for comparison
model_eval <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score"),
  Model_1 = c(acc_1, prec_1, rec_1, f1_1),
  Model_2 = c(acc_2, prec_2, rec_2, f1_2)
)

kable(model_eval)

	Metric	Model_1	Model_2
Accuracy	Accuracy	0.9225000	0.8862500
Precision	Precision	0.9423948	0.9054181
Recall	Recall	0.9566360	0.9497372
F1	F1 Score	0.9494620	0.9270483

library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

melted_eval <- melt(model_eval, id.vars = "Metric")

ggplot(melted_eval, aes(x = Metric, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Model Evaluation Comparison", y = "Score", x = "Metric") +
  theme_minimal() +
  scale_fill_manual(values = c("Model_1" = "steelblue", "Model_2" = "tomato"))

Model Performance Analysis

Accuracy:

Logistic Regression (92.25%) outperforms Decision Tree (88.63%)
Correctly classifies more overall instances.

Precision (Positive Predictive Value):

Logistic Regression: 94.24% of its positive predictions are correct.
Decision Tree: 90.54% precision.
Logistic Regression is better at avoiding false positives.

Recall (Sensitivity / True Positive Rate):

Logistic Regression: 95.66%, slightly higher than Decision Tree: 94.97%.
Both models are strong at identifying actual positives, but Logistic Regression is marginally better.

F1 Score (Harmonic mean of Precision and Recall):

Logistic Regression: 94.95%, higher than Decision Tree: 92.70%.
Indicates Logistic Regression strikes a better balance between precision and recall.

Conclusion

Logistic Regression model performs better overall across all key evaluation metrics—Accuracy, Precision, Recall, and F1 Score. It not only makes more correct predictions overall but also better balances between identifying positive cases and avoiding false positives.

Regression Modeling

set.seed(123) #for reproducibility- the analysis will produce the same results when running it multiple times
reg_split <- sample.split(data$RiskScore, SplitRatio = 0.8)
reg_train_data <- subset(data, split == TRUE)
reg_test_data <- subset(data, split == FALSE)
summary(reg_train_data$RiskScore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.80   46.00   52.00   50.76   56.00   79.00

summary(reg_test_data$RiskScore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.80   46.00   52.00   50.78   56.00   84.00

# Standardized the numeric features on training set to improve model performance & comparability for Regression Models
reg_features_to_scale <- c( "CreditScore", "DebtToIncomeRatio", "BankruptcyHistory", "PreviousLoanDefaults", "TotalAssets", "MonthlyIncome", "NetWorth", "InterestRate", "TotalDebtToIncomeRatio")
reg_train_data_scaled <- reg_train_data
reg_train_data_scaled[reg_features_to_scale] <- scale(reg_train_data[reg_features_to_scale])

reg_test_data_scaled <- reg_test_data
reg_test_data_scaled[reg_features_to_scale] <- scale(reg_test_data[reg_features_to_scale])

Model 3: Linear Regression

We are using Linear Regression to model and predict the RiskScore, a continuous numeric variable that quantifies the financial risk associated with a loan applicant.

Linear Regression is a fundamental regression technique that models the relationship between one dependent variable (in this case, RiskScore) and one or more independent variables (predictors) by fitting a linear equation to the observed data. It assumes a linear relationship between the input features and the target variable.

The goal is to estimate the coefficients that minimize the difference between the actual and predicted values of RiskScore.

Target variable:

RiskScore

Features used in the model:

CreditScore
DebtToIncomeRatio
BankruptcyHistory
PreviousLoanDefaults
TotalAssets
MonthlyIncome
NetWorth
InterestRate
TotalDebtToIncomeRatio

These features were selected based on domain relevance and correlation strength with RiskScore, as observed during exploratory data analysis. The model was trained on 80% of the data and evaluated on the remaining 20% using RMSE, MAE, and R-squared metrics to assess its predictive performance.

# Build the linear regression model
model_reg_1 <- lm(RiskScore ~ CreditScore + DebtToIncomeRatio + BankruptcyHistory +
                     PreviousLoanDefaults + TotalAssets + MonthlyIncome + NetWorth +
                     InterestRate + TotalDebtToIncomeRatio,
                   data = reg_train_data_scaled)

# Summary of the model
summary(model_reg_1)

## 
## Call:
## lm(formula = RiskScore ~ CreditScore + DebtToIncomeRatio + BankruptcyHistory + 
##     PreviousLoanDefaults + TotalAssets + MonthlyIncome + NetWorth + 
##     InterestRate + TotalDebtToIncomeRatio, data = reg_train_data_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.991  -2.478   0.322   2.735  44.094 
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            50.76305    0.03302 1537.518   <2e-16 ***
## CreditScore            -0.61732    0.04140  -14.910   <2e-16 ***
## DebtToIncomeRatio       2.49912    0.03303   75.665   <2e-16 ***
## BankruptcyHistory       2.94551    0.03303   89.185   <2e-16 ***
## PreviousLoanDefaults    2.03487    0.03303   61.609   <2e-16 ***
## TotalAssets             0.04880    0.15825    0.308    0.758    
## MonthlyIncome          -3.31265    0.03931  -84.266   <2e-16 ***
## NetWorth               -2.44938    0.15824  -15.478   <2e-16 ***
## InterestRate            1.39543    0.04145   33.666   <2e-16 ***
## TotalDebtToIncomeRatio  0.64448    0.03953   16.305   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 15990 degrees of freedom
## Multiple R-squared:  0.7124, Adjusted R-squared:  0.7122 
## F-statistic:  4400 on 9 and 15990 DF,  p-value: < 2.2e-16

# Predict on the test set
predictions <- predict(model_reg_1, newdata = test_data_scaled)

# View the first few predictions
head(predictions)

##          4          7         11         14         22         27 
## -243879.18  -53523.05 -315357.43 -203074.60  -13974.82  -41265.75

# Load required library
install.packages("Metrics")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'Metrics' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following object is masked from 'package:pROC':
## 
##     auc

## The following objects are masked from 'package:caret':
## 
##     precision, recall

# Actual vs Predicted
actual <- reg_test_data_scaled$RiskScore
predicted <- predictions

# Calculate performance metrics
mse <- mse(actual, predicted)
rmse <- rmse(actual, predicted)
mae <- mae(actual, predicted)
r_squared <- 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)

# Print the metrics
cat("MSE:", round(mse, 2), "\n")

## MSE: 123504775705

cat("RMSE:", round(rmse, 2), "\n")

## RMSE: 351432.5

cat("MAE:", round(mae, 2), "\n")

## MAE: 178214.2

cat("R-squared:", round(r_squared, 4), "\n")

## R-squared: -2055390643

Model 4: Random Forest Regression

We are using Random Forest Regression to model and predict the RiskScore, a continuous numeric variable that quantifies the financial risk associated with a loan applicant.

Random Forest is an ensemble learning algorithm that builds multiple decision trees during training and averages their predictions for regression tasks. This approach helps reduce overfitting and improves generalization performance, especially when the relationship between variables is complex and potentially non-linear.

The model captures interactions and nonlinear patterns in the data by leveraging the “wisdom of the crowd” from multiple trees, making it more robust than a single linear model.

Target variable:

RiskScore

Features used in the model:

CreditScore
DebtToIncomeRatio
BankruptcyHistory
PreviousLoanDefaults
TotalAssets
MonthlyIncome
NetWorth
InterestRate
TotalDebtToIncomeRatio

The model was trained using 80% of the data and evaluated on the remaining 20%. Evaluation metrics such as RMSE, MAE, and R-squared were used to assess model accuracy. Additionally, feature importance was analyzed to understand the contribution of each variable to the predictions.

install.packages("randomForest")

## Installing package into 'C:/Users/puter/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'randomForest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\puter\AppData\Local\Temp\RtmpEr2lRG\downloaded_packages

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Build random forest regressor
model_rf <- randomForest(RiskScore ~ CreditScore + DebtToIncomeRatio + BankruptcyHistory +
                           PreviousLoanDefaults + TotalAssets + MonthlyIncome + NetWorth +
                           InterestRate + TotalDebtToIncomeRatio,
                         data = reg_train_data_scaled, ntree = 100)

# Predict
predictions_rf <- predict(model_rf, newdata = reg_test_data_scaled)

# Evaluate
mse_rf <- mse(reg_test_data_scaled$RiskScore, predictions_rf)
rmse_rf <- rmse(reg_test_data_scaled$RiskScore, predictions_rf)
mae_rf <- mae(reg_test_data_scaled$RiskScore, predictions_rf)
r2_rf <- 1 - sum((reg_test_data_scaled$RiskScore - predictions_rf)^2) /
            sum((reg_test_data_scaled$RiskScore - mean(reg_test_data_scaled$RiskScore))^2)

cat("Random Forest Regression Performance:\n")

## Random Forest Regression Performance:

cat("MSE:", round(mse_rf, 2), "\n")

## MSE: 10.94

cat("RMSE:", round(rmse_rf, 2), "\n")

## RMSE: 3.31

cat("MAE:", round(mae_rf, 2), "\n")

## MAE: 2.4

cat("R-squared:", round(r2_rf, 4), "\n")

## R-squared: 0.8179

Model Performance Analysis

# Combine into one data frame for comparison
actual <- reg_test_data_scaled$RiskScore
pred_lm <- predict(model_reg_1, newdata = reg_test_data_scaled)
pred_rf <- predict(model_rf, newdata = reg_test_data_scaled)

viz_data <- data.frame(
  Actual = rep(actual, 2),
  Predicted = c(pred_lm, pred_rf),
  Model = rep(c("Linear Regression", "Random Forest"), each = length(actual))
)

library(ggplot2)
ggplot(viz_data, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "blue", linetype = "dashed") +
  facet_wrap(~ Model) +
  labs(title = "Actual vs. Predicted RiskScore",
       x = "Actual", y = "Predicted") +
  theme_minimal()

Linear Regression (Left plot): * The points are fairly close to the diagonal, but there’s more vertical spread (error). * This suggests the model captures the overall trend but has larger residuals, especially for extreme values (under/over-predicts a bit).

Random Forest (Right plot): * The points are tighter and more aligned along the dashed line. * This means the predictions are more accurate, especially across a wide range of RiskScore values.

The reduced spread indicates less variance and better generalization.

Conclusion: Random Forest performs better than Linear Regression in predicting RiskScore.

library(Metrics)

results <- data.frame(
  Model = c("Linear Regression", "Random Forest"),
  RMSE = c(rmse(actual, pred_lm), rmse(actual, pred_rf)),
  MAE = c(mae(actual, pred_lm), mae(actual, pred_rf)),
  R2 = c(
    1 - sum((actual - pred_lm)^2) / sum((actual - mean(actual))^2),
    1 - sum((actual - pred_rf)^2) / sum((actual - mean(actual))^2)
  )
)

# RMSE bar chart
ggplot(results, aes(x = Model, y = RMSE, fill = Model)) +
  geom_col() +
  labs(title = "RMSE Comparison", y = "RMSE") +
  theme_minimal()

The Random Forest model achieved a lower RMSE, indicating that it makes more accurate predictions of RiskScore on the test set compared to Linear Regression.

# R-squared bar chart
ggplot(results, aes(x = Model, y = R2, fill = Model)) +
  geom_col() +
  labs(title = "R-squared Comparison", y = "R-squared") +
  theme_minimal()

The Random Forest model explains a greater proportion of the variance in RiskScore, suggesting it has stronger predictive power and better fits the data than the Linear Regression model.

Feature Importance (Random Forest)

importance <- importance(model_rf)
importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])

# Plot
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Feature Importance (Random Forest)",
       x = "Features", y = "Importance") +
  theme_minimal()

Key Insights:

TotalDebtToIncomeRatio is the most important predictor, indicating that applicants with higher debt burdens relative to income are more likely to have elevated risk scores.
MonthlyIncome also plays a crucial role — it suggests that income level is strongly predictive of an applicant’s financial stability.
BankruptcyHistory and DebtToIncomeRatio follow closely, which aligns well with real-world expectations: historical financial behavior and current debt load significantly influence credit risk.
Other features like NetWorth, InterestRate, and CreditScore also contribute, but with lesser influence.

Discussion:

Based on both visual and quantitative evaluation metrics, the Random Forest Regression model outperforms the Linear Regression model. It shows a lower RMSE and a higher R², indicating improved accuracy and better generalization. The scatter plots of predicted vs. actual values also support this conclusion, as the Random Forest predictions align more closely along the ideal diagonal line.
The Random Forest model identified TotalDebtToIncomeRatio and MonthlyIncome as the most influential features in predicting RiskScore, followed by BankruptcyHistory and DebtToIncomeRatio. This aligns with financial logic — higher debt obligations and lower income levels generally indicate higher financial risk. These insights can inform future loan approval strategies and highlight which applicant attributes require closer scrutiny.

Conclusion

This project successfully demonstrated the power and versatility of the R ecosystem for executing an end to end data science workflow. Essential libraries like tidyverse and dplyr were pivotal for data preparation, while ggplot2 and corrplot enabled crucial data exploration. The modeling phase was effectively handled using packages such as caret, rpart, and randomForest, allowing for the development and rigorous evaluation of multiple machine learning models that successfully met the project’s objectives.

The analysis identified clear winning models for each predictive task and uncovered the key drivers of financial risk. For binary classification, Logistic Regression proved superior with an impressive 92.25% accuracy. In predicting the applicant’s RiskScore, Random Forest Regression was exceptionally effective, explaining 81.8% of the R² and highlighting that an applicant’s TotalDebtToIncomeRatio and MonthlyIncome are the most influential predictive features. These findings provide a practical, data driven framework to enhance the accuracy and fairness of future loan approval decisions.

Summary of R Libraries/Packages

The R ecosystem provides a rich set of tools for every data science task. For our loan approval analysis, we utilized the following key packages to handle data pre-processing, create insightful visualizations, and build our predictive models.

Data Manipulation & Visualization

Libraries/Packages	Description
tidyverse	A collection of essential R packages for data science, including `dplyr` and `ggplot2`.
dplyr	Used for data manipulation tasks like `mutate` and `select`.
knitr	Used for creating formatted tables using `kable()`.
reshape2	Used for restructuring data, specifically for creating the model comparison plot.
skimr	Used for generating summary statistics of the data.
caTools	Used for splitting the data into training and testing sets using `sample.split()`.
ggplot2	The primary tool used for creating plots: histograms, boxplots, bar charts, scatter plots.
corrplot	Used specifically to create the correlation heatmap.
rpart.plot	Used to visualize the decision tree model.

Machine Learning & Modeling

Libraries/Packages	Description
caret	A comprehensive framework for model training, feature selection, and evaluation. It was used for `featurePlot`.
rpart	The package used to build the Decision Tree model.
randomForest	The package used to build the Random Forest model.
pROC	Used to generate and analyze the ROC curve and calculate the AUC for the logistic regression model.
Metrics	Used to calculate regression model performance metrics like MSE, RMSE, and MAE.

WQD7004 - Programming for Data Science

Predictive Modelling for Loan Approval (Group 13)

Introduction

Problem Statement

Objectives

Research Questions

Chosen Dataset

Import Dataset

Data Overview

Data Pre-Processing/Cleaning

Exploratory Data Analysis (EDA)

Bivariate Categorical Analysis:

Relationship Between Income, Risk, and Loan Approval:

Relationship Between Income, Risk, and Loan Approval:

Explore Potential Predictive Features

Data Modeling (Machine learning)

Binary Classification

Model 1: Logistic Regression

Model 2: Decision Tree

Model Performance Analysis

Regression Modeling

Model 3: Linear Regression

Model 4: Random Forest Regression

Model Performance Analysis

Conclusion

Summary of R Libraries/Packages