Group 9: Loan Eligibility Prediction and Default Risk

Group Members:

Mohammad Iqbal Afif bin Mohamad Shahnaz (U2004720)
Ahmad Marwan Bin Murshidi (17205900)
Mawaddah Binti Musthafa (17147301)
Mohamad Ziqry Bin Zulkifli (S2153309)
Ahmad Salim Odeh Alsane (23083537)

Determine the details about the dataset (e.g. title, year, purpose of dataset, dimension, content, structure, and the summary) by exploring the raw data.
Data cleaning: Putting it all together as a new cleaned/processed dataset. Which section of the data do you need to tidy? You can use any cleaning packages in R
Data analysis: Identify the questions, what is the objective/goal of processing this dataset? What answers are you interested to find from this dataset? You must have at least two questions. It is best to have different types of problems, i.e. one regression, and one classification.

Introduction: In the past, banks used a manual and often unreliable process to review loan applications. This traditional method raised some struggles to accurately predict the risk of default. This project leverages machine learning techniques in R to develop a Loan Prediction system. By automating and enhancing the assessment process, this system aims to provide faster and more accurate evaluations, ultimately helping banks minimize losses from loan defaults.

Objectives: 1. To predict potential customer’s ability to repay their loan based on the information obtained from existing customers. 2. To develop a classification model to predict the applicant’s approval for a loan based on their financial and demographic information. 3. To create a regression model to estimate the probability of default for each applicant. 4. To improve the accuracy of loan eligibility predictions to enhance the identification of potential defaulters and minimize banks’ financial risks.

Questions: 1. Classification problem - Which applicants are likely to be eligible for a loan? (Target Variable: Loan.Status) 2. Regression problem - What is the expected default risk for each applicant? (Target Variable: Credit.Score)

Install and Import Libraries

# install.packages("psych")
library(psych)
# install.packages("tidyverse")
library(tidyverse)
# install.packages("scales")
library(scales)
# install.packages("stats")
library(stats)
# install.packages("patchwork")
library(patchwork)
# install.packages("corrplot")
library(corrplot)
# install.packages("car")
library(car)
# install.packages("caret")
library(caret)
# install.packages("smotefamily")
library(smotefamily)
# install.packages('randomForest')
library(randomForest)
# install.packages('xgboost')
library(xgboost)
# install.packages('pROC')
library(pROC)

Loading Data

df_raw <- read.csv("https://github.com/bilecops/Projects/blob/71b0f6e4f715355606e8e2ed3cbb729d16a105ab/credit_train.csv?raw=true")

Data Exploration

Display the columns name

colnames(df_raw)

##  [1] "Loan.ID"                      "Customer.ID"                 
##  [3] "Loan.Status"                  "Current.Loan.Amount"         
##  [5] "Term"                         "Credit.Score"                
##  [7] "Annual.Income"                "Years.in.current.job"        
##  [9] "Home.Ownership"               "Purpose"                     
## [11] "Monthly.Debt"                 "Years.of.Credit.History"     
## [13] "Months.since.last.delinquent" "Number.of.Open.Accounts"     
## [15] "Number.of.Credit.Problems"    "Current.Credit.Balance"      
## [17] "Maximum.Open.Credit"          "Bankruptcies"                
## [19] "Tax.Liens"

Display the data frame

head(df_raw)

##                                Loan.ID                          Customer.ID
## 1 14dd8831-6af5-400b-83ec-68e61888a048 981165ec-3274-42f5-a3b4-d104041a9ca9
## 2 4771cc26-131a-45db-b5aa-537ea4ba5342 2de017a3-2e01-49cb-a581-08169e83be29
## 3 4eed4e6a-aa2f-4c91-8651-ce984ee8fb26 5efb2b2b-bf11-4dfd-a572-3761a2694725
## 4 77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a e777faab-98ae-45af-9a86-7ce5b33b1011
## 5 d4062e70-befa-4995-8643-a0de73938182 81536ad9-5ccf-4eb8-befb-47a4d608658e
## 6 89d8cb0c-e5c2-4f54-b056-48a645c543dd 4ffe99d3-7f2a-44db-afc1-40943f1f9750
##   Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
## 1  Fully Paid              445412 Short Term          709       1167493
## 2  Fully Paid              262328 Short Term           NA            NA
## 3  Fully Paid            99999999 Short Term          741       2231892
## 4  Fully Paid              347666  Long Term          721        806949
## 5  Fully Paid              176220 Short Term           NA            NA
## 6 Charged Off              206602 Short Term         7290        896857
##   Years.in.current.job Home.Ownership            Purpose Monthly.Debt
## 1              8 years  Home Mortgage  Home Improvements      5214.74
## 2            10+ years  Home Mortgage Debt Consolidation     33295.98
## 3              8 years       Own Home Debt Consolidation     29200.53
## 4              3 years       Own Home Debt Consolidation      8741.90
## 5              5 years           Rent Debt Consolidation     20639.70
## 6            10+ years  Home Mortgage Debt Consolidation     16367.74
##   Years.of.Credit.History Months.since.last.delinquent Number.of.Open.Accounts
## 1                    17.2                           NA                       6
## 2                    21.1                            8                      35
## 3                    14.9                           29                      18
## 4                    12.0                           NA                       9
## 5                     6.1                           NA                      15
## 6                    17.3                           NA                       6
##   Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1                         1                 228190              416746
## 2                         0                 229976              850784
## 3                         1                 297996              750090
## 4                         0                 256329              386958
## 5                         0                 253460              427174
## 6                         0                 215308              272448
##   Bankruptcies Tax.Liens
## 1            1         0
## 2            0         0
## 3            0         0
## 4            0         0
## 5            0         0
## 6            0         0

tail(df_raw)

##        Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509                                                  NA                NA
## 100510                                                  NA                NA
## 100511                                                  NA                NA
## 100512                                                  NA                NA
## 100513                                                  NA                NA
## 100514                                                  NA                NA
##        Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509            NA                                                       NA
## 100510            NA                                                       NA
## 100511            NA                                                       NA
## 100512            NA                                                       NA
## 100513            NA                                                       NA
## 100514            NA                                                       NA
##        Years.of.Credit.History Months.since.last.delinquent
## 100509                      NA                           NA
## 100510                      NA                           NA
## 100511                      NA                           NA
## 100512                      NA                           NA
## 100513                      NA                           NA
## 100514                      NA                           NA
##        Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509                      NA                        NA                     NA
## 100510                      NA                        NA                     NA
## 100511                      NA                        NA                     NA
## 100512                      NA                        NA                     NA
## 100513                      NA                        NA                     NA
## 100514                      NA                        NA                     NA
##        Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509                  NA           NA        NA
## 100510                  NA           NA        NA
## 100511                  NA           NA        NA
## 100512                  NA           NA        NA
## 100513                  NA           NA        NA
## 100514                  NA           NA        NA

Change missing value as white space into NA value

df_raw[df_raw == "" | trimws(df_raw) == ""] <- NA
tail(df_raw)

##        Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100510    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100511    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100512    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100513    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100514    <NA>        <NA>        <NA>                  NA <NA>           NA
##        Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509            NA                 <NA>           <NA>    <NA>           NA
## 100510            NA                 <NA>           <NA>    <NA>           NA
## 100511            NA                 <NA>           <NA>    <NA>           NA
## 100512            NA                 <NA>           <NA>    <NA>           NA
## 100513            NA                 <NA>           <NA>    <NA>           NA
## 100514            NA                 <NA>           <NA>    <NA>           NA
##        Years.of.Credit.History Months.since.last.delinquent
## 100509                      NA                           NA
## 100510                      NA                           NA
## 100511                      NA                           NA
## 100512                      NA                           NA
## 100513                      NA                           NA
## 100514                      NA                           NA
##        Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509                      NA                        NA                     NA
## 100510                      NA                        NA                     NA
## 100511                      NA                        NA                     NA
## 100512                      NA                        NA                     NA
## 100513                      NA                        NA                     NA
## 100514                      NA                        NA                     NA
##        Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509                  NA           NA        NA
## 100510                  NA           NA        NA
## 100511                  NA           NA        NA
## 100512                  NA           NA        NA
## 100513                  NA           NA        NA
## 100514                  NA           NA        NA

Check the number of rows with all NA values across columns

sum(apply(df_raw, 1, function(row) all(is.na(row))))

## [1] 514

Drop row with all NA values across columns

df1 <- df_raw[rowSums(is.na(df_raw)) != ncol(df_raw), ]

Check the current number of rows and columns of data frame

cat("Number of row: ", nrow(df1), "\n", "Number of column: ", ncol(df1), sep="")

## Number of row: 100000
## Number of column: 19

Check the data type of each column

lapply(df1, class)

## $Loan.ID
## [1] "character"
## 
## $Customer.ID
## [1] "character"
## 
## $Loan.Status
## [1] "character"
## 
## $Current.Loan.Amount
## [1] "integer"
## 
## $Term
## [1] "character"
## 
## $Credit.Score
## [1] "integer"
## 
## $Annual.Income
## [1] "integer"
## 
## $Years.in.current.job
## [1] "character"
## 
## $Home.Ownership
## [1] "character"
## 
## $Purpose
## [1] "character"
## 
## $Monthly.Debt
## [1] "numeric"
## 
## $Years.of.Credit.History
## [1] "numeric"
## 
## $Months.since.last.delinquent
## [1] "integer"
## 
## $Number.of.Open.Accounts
## [1] "integer"
## 
## $Number.of.Credit.Problems
## [1] "integer"
## 
## $Current.Credit.Balance
## [1] "integer"
## 
## $Maximum.Open.Credit
## [1] "integer"
## 
## $Bankruptcies
## [1] "integer"
## 
## $Tax.Liens
## [1] "integer"

Describe the data

describe(df1)

##                              vars      n        mean          sd    median
## Loan.ID*                        1 100000    41014.91    23696.13   41036.5
## Customer.ID*                    2 100000    41016.45    23686.34   41020.5
## Loan.Status*                    3 100000        1.77        0.42       2.0
## Current.Loan.Amount             4 100000 11760447.39 31783942.55  312246.0
## Term*                           5 100000        1.72        0.45       2.0
## Credit.Score                    6  80846     1076.46     1475.40     724.0
## Annual.Income                   7  80846  1378276.56  1081360.20 1174162.0
## Years.in.current.job*           8 100000        5.12        3.10       4.0
## Home.Ownership*                 9 100000        2.93        0.95       3.0
## Purpose*                       10 100000        4.80        2.21       4.0
## Monthly.Debt                   11 100000    18472.41    12174.99   16220.3
## Years.of.Credit.History        12 100000       18.20        7.02      16.9
## Months.since.last.delinquent   13  46859       34.90       22.00      32.0
## Number.of.Open.Accounts        14 100000       11.13        5.01      10.0
## Number.of.Credit.Problems      15 100000        0.17        0.48       0.0
## Current.Credit.Balance         16 100000   294637.38   376170.93  209817.0
## Maximum.Open.Credit            17  99998   760798.38  8384503.47  467874.0
## Bankruptcies                   18  99796        0.12        0.35       0.0
## Tax.Liens                      19  99990        0.03        0.26       0.0
##                                 trimmed       mad     min          max
## Loan.ID*                       41019.01  30426.66     1.0      81999.0
## Customer.ID*                   41016.79  30410.35     1.0      81999.0
## Loan.Status*                       1.84      0.00     1.0          2.0
## Current.Loan.Amount          2191409.03 218926.65 10802.0   99999999.0
## Term*                              1.78      0.00     1.0          2.0
## Credit.Score                     722.47     25.20   585.0       7510.0
## Annual.Income                1249929.04 564430.27 76627.0  165557393.0
## Years.in.current.job*              4.85      2.97     1.0         12.0
## Home.Ownership*                    2.92      1.48     1.0          4.0
## Purpose*                           4.25      0.00     1.0         16.0
## Monthly.Debt                   17090.61   9969.71     0.0     435843.3
## Years.of.Credit.History           17.53      5.93     3.6         70.5
## Months.since.last.delinquent      33.63     25.20     0.0        176.0
## Number.of.Open.Accounts           10.66      4.45     0.0         76.0
## Number.of.Credit.Problems          0.05      0.00     0.0         15.0
## Current.Credit.Balance        239638.11 170509.38     0.0   32878968.0
## Maximum.Open.Credit           526283.54 342333.82     0.0 1539737892.0
## Bankruptcies                       0.01      0.00     0.0          7.0
## Tax.Liens                          0.00      0.00     0.0         15.0
##                                     range   skew kurtosis        se
## Loan.ID*                          81998.0   0.00    -1.20     74.93
## Customer.ID*                      81998.0   0.00    -1.20     74.90
## Loan.Status*                          1.0  -1.31    -0.29      0.00
## Current.Loan.Amount            99989197.0   2.42     3.84 100509.65
## Term*                                 1.0  -0.99    -1.02      0.00
## Credit.Score                       6925.0   3.86    12.97      5.19
## Annual.Income                 165480766.0  46.89  6623.59   3803.13
## Years.in.current.job*                11.0   0.72    -0.59      0.01
## Home.Ownership*                       3.0   0.12    -1.86      0.00
## Purpose*                             15.0   2.21     4.55      0.01
## Monthly.Debt                     435843.3   2.21    22.19     38.50
## Years.of.Credit.History              66.9   1.07     1.74      0.02
## Months.since.last.delinquent        176.0   0.43    -0.75      0.10
## Number.of.Open.Accounts              76.0   1.18     3.04      0.02
## Number.of.Credit.Problems            15.0   4.82    48.01      0.00
## Current.Credit.Balance         32878968.0  14.15   697.45   1189.56
## Maximum.Open.Credit          1539737892.0 132.63 20393.41  26514.39
## Bankruptcies                          7.0   3.51    18.53      0.00
## Tax.Liens                            15.0  15.50   402.04      0.00

Check the number of unique values for each column

lapply(df1, function(x) length(unique(x)))

## $Loan.ID
## [1] 81999
## 
## $Customer.ID
## [1] 81999
## 
## $Loan.Status
## [1] 2
## 
## $Current.Loan.Amount
## [1] 22004
## 
## $Term
## [1] 2
## 
## $Credit.Score
## [1] 325
## 
## $Annual.Income
## [1] 36175
## 
## $Years.in.current.job
## [1] 12
## 
## $Home.Ownership
## [1] 4
## 
## $Purpose
## [1] 16
## 
## $Monthly.Debt
## [1] 65765
## 
## $Years.of.Credit.History
## [1] 506
## 
## $Months.since.last.delinquent
## [1] 117
## 
## $Number.of.Open.Accounts
## [1] 51
## 
## $Number.of.Credit.Problems
## [1] 14
## 
## $Current.Credit.Balance
## [1] 32730
## 
## $Maximum.Open.Credit
## [1] 44597
## 
## $Bankruptcies
## [1] 9
## 
## $Tax.Liens
## [1] 13

Check the unique values with 16 or less unique values

columns <- names(Filter(function(x) length(unique(x)) <= 16, df1))

for (col in columns) {
  cat(col, ":\n")
  cat("Value counts:\n")
  print(table(df1[[col]]))
  cat("\n", paste(rep("__", 20), collapse = ""), "\n\n")
}

## Loan.Status :
## Value counts:
## 
## Charged Off  Fully Paid 
##       22639       77361 
## 
##  ________________________________________ 
## 
## Term :
## Value counts:
## 
##  Long Term Short Term 
##      27792      72208 
## 
##  ________________________________________ 
## 
## Years.in.current.job :
## Value counts:
## 
##  < 1 year    1 year 10+ years   2 years   3 years   4 years   5 years   6 years 
##      8164      6460     31121      9134      8169      6143      6787      5686 
##   7 years   8 years   9 years       n/a 
##      5577      4582      3955      4222 
## 
##  ________________________________________ 
## 
## Home.Ownership :
## Value counts:
## 
##  HaveMortgage Home Mortgage      Own Home          Rent 
##           214         48410          9182         42194 
## 
##  ________________________________________ 
## 
## Purpose :
## Value counts:
## 
##        Business Loan            Buy a Car            Buy House 
##                 1569                 1265                  678 
##   Debt Consolidation Educational Expenses    Home Improvements 
##                78552                   99                 5839 
##       major_purchase        Medical Bills               moving 
##                  352                 1127                  150 
##                other                Other     renewable_energy 
##                 6037                 3250                   10 
##       small_business          Take a Trip             vacation 
##                  283                  573                  101 
##              wedding 
##                  115 
## 
##  ________________________________________ 
## 
## Number.of.Credit.Problems :
## Value counts:
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 86035 12077  1299   378   125    49    17     8     4     2     2     2     1 
##    15 
##     1 
## 
##  ________________________________________ 
## 
## Bankruptcies :
## Value counts:
## 
##     0     1     2     3     4     5     6     7 
## 88774 10475   417    93    27     7     2     1 
## 
##  ________________________________________ 
## 
## Tax.Liens :
## Value counts:
## 
##     0     1     2     3     4     5     6     7     9    10    11    15 
## 98062  1343   374   111    58    16    12     7     3     1     2     1 
## 
##  ________________________________________

Change the string “n/a” values into NA values in “Years.in.current.job” column

df1$Years.in.current.job[df1$Years.in.current.job == "n/a"] <- NA

Check the unique value of “Credit.Score” column

output <- paste(
  "Credit.Score:\n",
  "Value counts:\n",
  paste(capture.output(print(table(df1[["Credit.Score"]]))), collapse = "\n"),
  sep = ""
)
cat(output)

## Credit.Score:
## Value counts:
## 
##  585  586  587  588  589  590  591  592  593  594  595  596  597  598  599  600 
##   12    7   11   20    6    8    9    4    7   10   19   15   23    9   15   25 
##  601  602  603  604  605  606  607  608  609  610  611  612  613  614  615  616 
##   29   16   19   13   23   20   13   19   26   28   29   32   23   33   30   34 
##  617  618  619  620  621  622  623  624  625  626  627  628  629  630  631  632 
##   31   32   42   26   41   50   28   52   36   39   43   44   30   35   33   67 
##  633  634  635  636  637  638  639  640  641  642  643  644  645  646  647  648 
##   58   45   72   62   47   71   42   88   77   81   87   73   99   75   71   76 
##  649  650  651  652  653  654  655  656  657  658  659  660  661  662  663  664 
##  126   98  101  122  118  129  147  171  123  118  148  144  168  137  182  173 
##  665  666  667  668  669  670  671  672  673  674  675  676  677  678  679  680 
##  161  167  176  227  230  227  208  219  212  262  262  284  234  298  285  346 
##  681  682  683  684  685  686  687  688  689  690  691  692  693  694  695  696 
##  277  294  365  327  369  369  367  338  384  320  386  504  509  506  506  483 
##  697  698  699  700  701  702  703  704  705  706  707  708  709  710  711  712 
##  552  595  660  612  620  644  717  716  727  791  814  902  865  857  791 1006 
##  713  714  715  716  717  718  719  720  721  722  723  724  725  726  727  728 
## 1017 1046 1070 1061 1218 1261 1118 1216 1465 1387 1421 1522 1548  853  806  931 
##  729  730  731  732  733  734  735  736  737  738  739  740  741  742  743  744 
##  925  984 1010 1084 1073 1147 1134 1156 1405 1495 1624 1746 1732 1723 1555 1485 
##  745  746  747  748  749  750  751 5850 5860 5890 5900 5920 5930 5940 5950 5960 
## 1612 1742 1825 1598  827 1234  723    2    1    1    1    1    2    3    3    1 
## 6010 6050 6060 6070 6080 6090 6100 6110 6120 6130 6140 6150 6160 6170 6180 6190 
##    1    1    2    1    2    4    2    4    2    3    2    3    6    4    5    2 
## 6200 6210 6220 6230 6240 6250 6260 6270 6280 6290 6300 6310 6320 6330 6340 6350 
##    2    3    7    6    8    4    7    4    3    5    6    4    7    6    4    6 
## 6360 6370 6380 6390 6400 6410 6420 6430 6440 6450 6460 6470 6480 6490 6500 6510 
##    4    2    9    7    9   12    6    4    7   13    3    6    6    7   13    9 
## 6520 6530 6540 6550 6560 6570 6580 6590 6600 6610 6620 6630 6640 6650 6660 6670 
##   15   12    7    8   12   11   13   12   11   13   13   21    9   25   19   17 
## 6680 6690 6700 6710 6720 6730 6740 6750 6760 6770 6780 6790 6800 6810 6820 6830 
##   17   16   17   14   17   16   24   22   12   16   20   18   25   30   21   24 
## 6840 6850 6860 6870 6880 6890 6900 6910 6920 6930 6940 6950 6960 6970 6980 6990 
##   35   25   28   27   24   29   25   22   29   36   38   39   38   32   36   46 
## 7000 7010 7020 7030 7040 7050 7060 7070 7080 7090 7100 7110 7120 7130 7140 7150 
##   35   46   37   39   49   44   45   52   50   58   54   45   51   53   62   45 
## 7160 7170 7180 7190 7200 7210 7220 7230 7240 7250 7260 7270 7280 7290 7300 7310 
##   68   74   57   68   63   57   59   63   84   75   64   74   77   60   73   73 
## 7320 7330 7340 7350 7360 7370 7380 7390 7400 7410 7420 7430 7440 7450 7460 7470 
##   83   88   87   76   85  106   93  108   97   87   84   70   58   55   76   51 
## 7480 7490 7500 7510 
##   43   23   24    9

Pie chart of target variable “Loan Status”

counts <- table(df1$`Loan.Status`)
percentages <- prop.table(counts) * 100
status <- c("Charged Off", "Fully Paid")
labels <- paste(status, "\n", "Count:", counts, "\n", "(", round(percentages, 1), "%)")
par(pin = c(3, 3))

pie(percentages, 
    labels = labels, 
    col = c("red", "green"), 
    main = 'Percentage distribution of "Loan Status" column',
    cex = 1.5)

Data Exploration Report

Contain 100000 rows and 19 columns (after deleted 514 rows with all NA values across columns)
7 categorical columns and 12 numerical columns
In “Years.in.current.job” column, all string values of “n/a” is changed into NA values
“Credit.Score” have anomalies, maximum is 7510 where the range should be around 300 - 850
“Credit.Score” with more than 850 might have data entry error, where there is 1 random leading zero added
Should remove extra leading 0 from “Credit.Score” with more than 850
The target variable, “Loan.Status” distribution is imbalance

Data Cleaning

Duplicates

Create copy of data frame

df2 <- data.frame(df1)

Check the number of duplicates and percentage of duplicates

cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")

## Number of duplicates: 10215
## Percentage of duplicates: 10.215%

Drop the duplicates

df2 <- df2[!duplicated(df2),]
cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")

## Number of duplicates: 0
## Percentage of duplicates: 0%

Missing Values

Checking the missing values for each column

for (col in colnames(df2)) {
  if (any(is.na(df2[[col]]))) {  # Access column by name
    cat("Column: ", "\"", col, "\"", "\n", 
        "Missing Values Count: ", sum(is.na(df2[[col]])), "\n", 
        "Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
        "_____________________________\n\n", sep="")
  }
}

## Column: "Credit.Score"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
## 
## Column: "Annual.Income"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
## 
## Column: "Years.in.current.job"
## Missing Values Count: 3802
## Percentage: 4.23%
## _____________________________
## 
## Column: "Months.since.last.delinquent"
## Missing Values Count: 48337
## Percentage: 53.84%
## _____________________________
## 
## Column: "Maximum.Open.Credit"
## Missing Values Count: 2
## Percentage: 0%
## _____________________________
## 
## Column: "Bankruptcies"
## Missing Values Count: 190
## Percentage: 0.21%
## _____________________________
## 
## Column: "Tax.Liens"
## Missing Values Count: 9
## Percentage: 0.01%
## _____________________________

Comments:

“Months.since.last.delinquent” column has more than 50% missing values, column will be dropped
Row with missing value in “Years.in.current.job”, “Bankruptcies”, “Tax.Liens”, “Maximum.Open.Credit” will be dropped since the percentage of missing value is relatively low (less than 5%)
Drop “Months.since.last.delinquent” column

df2$Months.since.last.delinquent <- NULL

Drop rows with missing value in “Years.in.current.job”, “Bankruptcies”, “Tax.Liens”, “Maximum.Open.Credit”

df2 <- df2[complete.cases(df2[, c("Years.in.current.job", "Bankruptcies", "Tax.Liens", "Maximum.Open.Credit")]), ]

cat("Number of row: ", nrow(df2), "\n", "Number of column: ", ncol(df2), sep="")

## Number of row: 85791
## Number of column: 18

Checking the missing values for each column

for (col in colnames(df2)) {
  if (any(is.na(df2[[col]]))) {  # Access column by name
    cat("Column: ", "\"", col, "\"", "\n", 
        "Missing Values Count: ", sum(is.na(df2[[col]])), "\n", 
        "Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
        "_____________________________\n\n", sep="")
  }
}

## Column: "Credit.Score"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________
## 
## Column: "Annual.Income"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________

Handling Anomalies in “Credit.Score”

Check the distribution of “Credit.Score”

hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

The range of “Credit.Score” should be around 300 - 850
“Credit Score” with data entry error have 1 extra leading zero
1 extra leading zero will be remove from “Credit.Score” more than 850
Remove 1 extra leading zero from “Credit Score” with value more than 850

df2$Credit.Score <- ifelse(df2$Credit.Score > 850, 
                           df2$Credit.Score / 10, 
                           df2$Credit.Score)

Check the current distribution of “Credit.Score”

hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

Now the all the values of “Credit.Score” in the correct range of 300 - 850

Imputation for missing values in “Credit.Score”

Describe the “Credit.Score” statistics

describe(df2["Credit.Score"])

##              vars     n   mean    sd median trimmed   mad min max range  skew
## Credit.Score    1 67490 718.55 28.18    725  722.81 23.72 585 751   166 -1.45
##              kurtosis   se
## Credit.Score     2.31 0.11

Impute missing value in “Credit.Score” by using median value

# Calculate the median of Credit.Score, excluding missing values
median_value <- median(df2$Credit.Score, na.rm = TRUE)
cat('Median for "Credit.Score":', median_value)

## Median for "Credit.Score": 725

# Replace missing values in Credit.Score with the median
df2$Credit.Score[is.na(df2$Credit.Score)] <- median_value

Check the current distribution of “Credit.Score”

hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

Describe the current “Credit.Score” statistics

describe(df2["Credit.Score"])

##              vars     n   mean    sd median trimmed   mad min max range  skew
## Credit.Score    1 85791 719.93 25.14    725  723.84 17.79 585 751   166 -1.77
##              kurtosis   se
## Credit.Score     3.98 0.09

Imputation for missing values in “Annual.Income”

Check the distribution of “Annual.Income”

hist(log(df2$Annual.Income), 
     main = "Distribution of Annual.Income", 
     xlab = "Annual.Income", 
     col = "skyblue", 
     border = "white")

Describe the “Annual.Income” statistics

describe(df2["Annual.Income"])

##               vars     n    mean      sd  median trimmed      mad   min
## Annual.Income    1 67490 1396654 1118572 1198701 1266145 570050.1 76627
##                     max     range  skew kurtosis      se
## Annual.Income 165557393 165480766 50.36  6924.17 4305.71

Impute missing value in “Annual.Income” by using median value

# Calculate the median of Annual.Income, excluding missing values
median_value <- median(df2$Annual.Income, na.rm = TRUE)
cat('Median for "Annual.Income":', median_value)

## Median for "Annual.Income": 1198701

# Replace missing values in Annual.Income with the median
df2$Annual.Income[is.na(df2$Annual.Income)] <- median_value

Check the current distribution of “Annual.Income”

hist(log(df2$Annual.Income), 
     main = "Distribution of Annual.Income", 
     xlab = "Annual.Income", 
     col = "skyblue", 
     border = "white")

Describe the current “Annual.Income” statistics

describe(df2["Annual.Income"])

##               vars     n    mean       sd  median trimmed      mad   min
## Annual.Income    1 85791 1354427 995424.1 1198701 1238856 397568.8 76627
##                     max     range  skew kurtosis     se
## Annual.Income 165557393 165480766 56.34   8695.7 3398.5

Mapping Years in current job

Unique values before the mapping.

unique(df2$Years.in.current.job)

##  [1] "8 years"   "10+ years" "3 years"   "5 years"   "< 1 year"  "2 years"  
##  [7] "4 years"   "9 years"   "7 years"   "1 year"    "6 years"

job_years_mapping <- list(
  "< 1 year" = 1,
  "1 year" = 1,
  "2 years" = 2,
  "3 years" = 3,
  "4 years" = 4,
  "5 years" = 5,
  "6 years" = 6,
  "7 years" = 7,
  "8 years" = 8,
  "9 years" = 9,
  "10+ years" = 10
)

df2$`Years.in.current.job` <- sapply(df2$`Years.in.current.job`, function(x) job_years_mapping[[x]])

Unique values after the mapping.

unique(df2$Years.in.current.job)

##  [1]  8 10  3  5  1  2  4  9  7  6

Mapping Purpose

Unique values before the mapping.

unique(df2$Purpose)

##  [1] "Home Improvements"    "Debt Consolidation"   "Buy House"           
##  [4] "Business Loan"        "Buy a Car"            "other"               
##  [7] "major_purchase"       "Take a Trip"          "Other"               
## [10] "small_business"       "Medical Bills"        "wedding"             
## [13] "vacation"             "Educational Expenses" "moving"              
## [16] "renewable_energy"

Since there are so many classes, lets summaries the Purpose by the Loan Status.

crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab

##                       
##                        Charged Off Fully Paid
##   Business Loan                465        857
##   Buy a Car                    189        922
##   Buy House                    135        426
##   Debt Consolidation         16939      50938
##   Educational Expenses          14         67
##   Home Improvements           1080       3866
##   major_purchase                73        239
##   Medical Bills                248        655
##   moving                        38         87
##   other                       1315       3588
##   Other                        552       2200
##   renewable_energy               4          4
##   small_business               112        132
##   Take a Trip                  104        358
##   vacation                      26         57
##   wedding                       21         80

Debt Consolidation should be treated as its own class as it represents a significant portion of the dataset.

Map the purposes into aggregrated categories

# Define the purpose mapping
purpose_mapping <- list(
  "Debt Consolidation" = c("Debt Consolidation"),
  "Personal Expenses" = c("Medical Bills", "wedding", "Educational Expenses", "moving", "vacation", "Take a Trip"),
  "Home and Property" = c("Home Improvements", "Buy House", "renewable_energy"),
  "Business and Entrepreneurship" = c("Business Loan", "small_business"),
  "Transportation" = c("Buy a Car"),
  "Other" = c("major_purchase", "Other", "other")
)

# Reverse the mapping
reversed_purpose_mapping <- unlist(lapply(names(purpose_mapping), function(category) {
  setNames(rep(category, length(purpose_mapping[[category]])), purpose_mapping[[category]])
}))

# Map the reversed purpose mapping to the 'Purpose' column
df2$Purpose <- sapply(df2$Purpose, function(x) reversed_purpose_mapping[x])

crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab

##                                
##                                 Charged Off Fully Paid
##   Business and Entrepreneurship         577        989
##   Debt Consolidation                  16939      50938
##   Home and Property                    1219       4296
##   Other                                1940       6027
##   Personal Expenses                     451       1304
##   Transportation                        189        922

All the redundant classes have been categorised and are ready for further analyses.

Mapping Home Ownership

Unique values before the mapping.

unique(df2$Home.Ownership)

## [1] "Home Mortgage" "Own Home"      "Rent"          "HaveMortgage"

crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab

##                
##                 Charged Off Fully Paid
##   HaveMortgage           28        147
##   Home Mortgage        9353      32396
##   Own Home             1891       5614
##   Rent                10043      26319

Since the HaveMortgage has the same meaning with HomeMortgage without considering loan status, it is possible to group this two classes together to simplify the analysis without affecting the distribution. Own home is not classified under the home mortgage because the home is 100% totally own by the owner without having to pay any property loan to maintain the ownership of the house.

ownership_mapping <- list(
  "Home Mortgage" = c("Home Mortgage", "HaveMortgage"),
  "Rent" = c("Rent"),
  "Own Home" = c("Own Home")
)
reversed_ownership_mapping <- unlist(lapply(names(ownership_mapping), function(category) {
  setNames(rep(category, length(ownership_mapping[[category]])), ownership_mapping[[category]])
}))

df2$Home.Ownership <- sapply(df2$Home.Ownership, function(x) reversed_ownership_mapping[x])

Summary table of Home Ownership classes by Loan Status after the mapping.

crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab

##                
##                 Charged Off Fully Paid
##   Home Mortgage        9381      32543
##   Own Home             1891       5614
##   Rent                10043      26319

“HaveMortgage” have been merged with “Home Mortgage”.

Drop Loan ID and Customer ID

# Columns to drop
drop <- c("Customer.ID", "Loan.ID")
# Drop the columns
df2 <- df2[, !(names(df2) %in% drop)]

df2

The cleaned version of the dataframe.

head(df2)

##   Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
## 1  Fully Paid              445412 Short Term          709       1167493
## 2  Fully Paid              262328 Short Term          725       1198701
## 3  Fully Paid            99999999 Short Term          741       2231892
## 4  Fully Paid              347666  Long Term          721        806949
## 5  Fully Paid              176220 Short Term          725       1198701
## 6 Charged Off              206602 Short Term          729        896857
##   Years.in.current.job Home.Ownership            Purpose Monthly.Debt
## 1                    8  Home Mortgage  Home and Property      5214.74
## 2                   10  Home Mortgage Debt Consolidation     33295.98
## 3                    8       Own Home Debt Consolidation     29200.53
## 4                    3       Own Home Debt Consolidation      8741.90
## 5                    5           Rent Debt Consolidation     20639.70
## 6                   10  Home Mortgage Debt Consolidation     16367.74
##   Years.of.Credit.History Number.of.Open.Accounts Number.of.Credit.Problems
## 1                    17.2                       6                         1
## 2                    21.1                      35                         0
## 3                    14.9                      18                         1
## 4                    12.0                       9                         0
## 5                     6.1                      15                         0
## 6                    17.3                       6                         0
##   Current.Credit.Balance Maximum.Open.Credit Bankruptcies Tax.Liens
## 1                 228190              416746            1         0
## 2                 229976              850784            0         0
## 3                 297996              750090            0         0
## 4                 256329              386958            0         0
## 5                 253460              427174            0         0
## 6                 215308              272448            0         0

Remove duplicate 10215 rows of duplicates
“Months since last delinquent” column has more than 50% missing values, column will be dropped
Row with missing value in “Years in current job”, “Bankruptcies”, “Tax Liens”, “Maximum Open Credit” will be dropped since the percentage of missing value is relatively low (less than 5%)
Removed 1 extra leading zero from “Credit Score” with more than 850
Imputed missing value in “Credit Score” column by using median value
Imputed missing value in “Annual Income” column by using median value
Map the “Years in current job” column from string into integer
Map the “Purpose” column, from 16 categorical group into only 6 categorical group
Map the “Home Ownership” column, from 4 categorical group into 3 categorical groups
Dropped “Loan ID” and “Customer ID”

Exploratory Data Analysis

In this section, the relationship between the target variable and the independent features will be explored.
There are two problems to be explored: For classification problem, the Loan.Status column will be selected as the target variable. For regression problem, the Credit.Score column will be selected as the target variable.

Categorical Features

cat <- cbind(Credit.Score = df2$Credit.Score,df2[, sapply(df2, function(x) !is.numeric(x))])
head(cat)

##   Credit.Score Loan.Status       Term Home.Ownership            Purpose
## 1          709  Fully Paid Short Term  Home Mortgage  Home and Property
## 2          725  Fully Paid Short Term  Home Mortgage Debt Consolidation
## 3          741  Fully Paid Short Term       Own Home Debt Consolidation
## 4          721  Fully Paid  Long Term       Own Home Debt Consolidation
## 5          725  Fully Paid Short Term           Rent Debt Consolidation
## 6          729 Charged Off Short Term  Home Mortgage Debt Consolidation

The target variable for each problems and the categorical features are chosen for the categorical features analyses.

Relationship between the target variable and Term

cat$Term <- factor(cat$Term)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))

order <- cat %>%
  count(Term) %>%
  arrange(desc(n)) %>%
  pull(Term)

cat$Term <- factor(cat$Term, levels = order)

# Loan Status by Term
p1 <- ggplot(cat, aes(x = Term, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Term", y = "Count", title = "Loan Status by Term") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_discrete(limits = order)

# Credit Score by Term
p2 <- ggplot(cat, aes(x = Term, y = Credit.Score)) +
  geom_boxplot(aes(fill = Term)) +
  labs(x = "Term", y = "Credit Score", title = "Credit Score by Term") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p1 + p2

For short term, the number of customers who are able to fully paid their loan are higher than the charged off customers. The median credit score of the short term customers are also high even though there are few lower outliers. For long term, the distribution between fully paid and charged of loan is more balanced but the number of fully paid are still higher than charged off. The median credit score of the long term customer is also still considered good but lower compared to the median of short term credit score.

Customers tend to pay in full for short term loan compared to long term loan. They also tend to have a higher credit score because most of them are able to fully paid their loans compared to the long term customer.

Relationship between the target variable and Home.Ownership

crosstab <- table(cat$Home.Ownership, cat$Loan.Status)
crosstab

##                
##                 Fully Paid Charged Off
##   Home Mortgage      32543        9381
##   Own Home            5614        1891
##   Rent               26319       10043

order_home_ownership <- cat %>%
  count(Home.Ownership) %>%
  arrange(desc(n)) %>%
  pull(Home.Ownership)

cat$Home.Ownership <- factor(cat$Home.Ownership, levels = order_home_ownership)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))

# Loan Status by Home Ownership
p1 <- ggplot(cat, aes(x = Home.Ownership, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Home Ownership", y = "Count", title = "Loan Status by Home Ownership") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Credit Score by Home Ownership
p2 <- ggplot(cat, aes(x = Home.Ownership, y = Credit.Score)) +
  geom_boxplot(aes(fill = Home.Ownership)) +
  labs(x = "Home Ownership", y = "Credit Score", title = "Credit Score by Home Ownership") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p1 + p2

For home mortgage, customers tend to fully paid their loan compared to being charged off. The median credit score of customers in this class can be considered the same as the median credit score of the own home class. For rent, this class has the highest numbers of charged off customers and the lowest median credit score among all classes. The number of fully paid loans are higher than charged off for all classes while customers in rent class have a higher difficulty in paying their loans compared to other classes which results in lower median credit score.

Relationship between the target variable and Purpose

order_purpose <- cat %>%
  count(Purpose) %>%
  arrange(n) %>%
  pull(Purpose)

cat$Purpose <- factor(cat$Purpose, levels = order_purpose)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Charged Off", "Fully Paid"))

# Loan Status by Purpose
p1 <- ggplot(cat, aes(y = Purpose, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Count", y = "Purpose", title = "Loan Status by Purpose") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 30, hjust = 0.5),
    axis.title.x = element_text(size = 22),
    axis.title.y = element_text(size = 22),
    axis.text = element_text(size = 20),
    legend.title = element_text(size = 20),
    legend.text = element_text(size = 20)
  )

# Credit Score by Purpose
p2 <- ggplot(cat, aes(x = Purpose, y = Credit.Score)) +
  geom_boxplot(aes(fill = Purpose)) +
  labs(x = "Purpose", y = "Credit Score", title = "Credit Score by Purpose") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 30, hjust = 0.5),
    axis.title.x = element_text(size = 22),
    axis.title.y = element_text(size = 22),
    axis.text = element_text(size = 20),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_text(size = 20),
    legend.text = element_text(size = 20)
  )

p1 / p2

The majority of customers who applied loans for debt consolidation purposes, and most of these loan status are “Fully Paid.” However, a significant portion is also “Charged Off.” The majority of loans are fully paid for all classes, with debt consolidation being the exception with a much higher loan count.

Customers who applied loans for transportation tend to have the highest credit score which indicates they are lower-risk customers. The customers who applied loans for debt consolidation, home and property or the other purposes have moderate credit scores. The customers who applied loans for personal expenses or business purpose have lower and more varied credit scores which indicates that they are higher-risks customers.

Chi-square Test

For the classification problem, the chi-square test is used to determine the significance of the variables.

# Encode Loan.Status
cat$Loan.Status <- as.numeric(factor(cat$Loan.Status))

perform_chi_square_test <- function(var1, var2) {
  table <- table(var1, var2)
  chi_square_result <- chisq.test(table)
  
  result <- data.frame(
    Variable1 = deparse(substitute(var1)),
    Variable2 = deparse(substitute(var2)),
    Chi_Square = chi_square_result$statistic,
    DF = chi_square_result$parameter,
    P_Value = chi_square_result$p.value
  )
  
  return(result)
}

chisquareresults <- bind_rows(
  perform_chi_square_test(cat$Loan.Status, cat$Term),
  perform_chi_square_test(cat$Loan.Status, cat$Purpose),
  perform_chi_square_test(cat$Loan.Status, cat$Home.Ownership)
)

chisquareresults

##                     Variable1          Variable2 Chi_Square DF      P_Value
## X-squared...1 cat$Loan.Status           cat$Term  1776.9359  1 0.000000e+00
## X-squared...2 cat$Loan.Status        cat$Purpose   181.6589  5 2.366598e-37
## X-squared...3 cat$Loan.Status cat$Home.Ownership   287.2476  2 4.216768e-63

All categorical features for the classification problem are significant since their p-value is less than 0.05.

Analysis of Variance (ANOVA)

For the regression problem, the ANOVA is used to determine the significance of the variables.

perform_anova <- function(data, category_column, value_column, alpha = 0.05) {

  formula <- as.formula(paste(value_column, "~", category_column))
  
  anova_result <- aov(formula, data = data)

  summary_result <- summary(anova_result)[[1]]

  summary_result <- summary_result[summary_result[[1]] != "Residuals", ]
  
  # F-statistic and p-value
  f_statistic <- summary_result[1, "F value"]
  p_value <- summary_result[1, "Pr(>F)"]
  
  # F-critical value
  dfn <- length(unique(data[[category_column]])) - 1  
  dfd <- nrow(data) - length(unique(data[[category_column]]))  
  f_critical <- qf(1 - alpha, dfn, dfd)
  
  # Signifincance
  f_statistic_significance <- ifelse(f_statistic > f_critical, "Significant", "Not significant")
  p_value_significance <- ifelse(p_value < alpha, "Significant", "Not significant")
  
  result <- list(
    Feature = category_column,
    `F-statistic` = round(f_statistic, 6),
    `F-critical value` = round(f_critical, 6),
    `p-value` = round(p_value, 6),
    `F-statistic Significance` = f_statistic_significance,
    `p-value Significance` = p_value_significance
  )
  
  return(result)
}

anova_results <- bind_rows(
  perform_anova(cat, "Term", "Credit.Score"),
  perform_anova(cat, "Purpose", "Credit.Score"),
  perform_anova(cat, "Home.Ownership", "Credit.Score")
)

anova_results

## # A tibble: 3 × 6
##   Feature      `F-statistic` `F-critical value` `p-value` F-statistic Signific…¹
##   <chr>                <dbl>              <dbl>     <dbl> <chr>                 
## 1 Term               19382.                3.84         0 Significant           
## 2 Purpose              180.                2.21         0 Significant           
## 3 Home.Owners…          79.6               3.00         0 Significant           
## # ℹ abbreviated name: ¹`F-statistic Significance`
## # ℹ 1 more variable: `p-value Significance` <chr>

All categorical features for the regression problem are significant since their p-value is less than 0.05.

Numerical Features

Numerical features are chosen for the analyses.

num <- df2 %>% select_if(is.numeric)
head(num)

##   Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1              445412          709       1167493                    8
## 2              262328          725       1198701                   10
## 3            99999999          741       2231892                    8
## 4              347666          721        806949                    3
## 5              176220          725       1198701                    5
## 6              206602          729        896857                   10
##   Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1      5214.74                    17.2                       6
## 2     33295.98                    21.1                      35
## 3     29200.53                    14.9                      18
## 4      8741.90                    12.0                       9
## 5     20639.70                     6.1                      15
## 6     16367.74                    17.3                       6
##   Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1                         1                 228190              416746
## 2                         0                 229976              850784
## 3                         1                 297996              750090
## 4                         0                 256329              386958
## 5                         0                 253460              427174
## 6                         0                 215308              272448
##   Bankruptcies Tax.Liens
## 1            1         0
## 2            0         0
## 3            0         0
## 4            0         0
## 5            0         0
## 6            0         0

Correlation Matrix

corrplot(cor(num), 
         method = "shade",           
         mar = c(0, 0, 11, 0),
         title = "Correlation Matrix",
         cex.main = 6,
         tl.cex = 5.3,                  
         cl.cex = 2,                  
         addCoef.col = "black",
         number.cex = 3,            
         diag = F)

High positive correlation between Bankruptcies and Number of Credit Problems which is 0.75.
Moderately high correlation between Tax Liens and Number of Credit Problems is 0.58.
Moderate correlation between Monthly Debt and Annual Income is 0.41.
All other variables have low correlations.

Relationship between the target variable and Years in Current Job

df2$Years.in.current.job <- factor(df2$Years.in.current.job, 
                                   levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))

# Loan Status by Years in Current Job
p1 <- ggplot(df2, aes(x = Years.in.current.job, fill = `Loan.Status`)) +
  geom_bar(position = "dodge") +
  scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years', 
                              '6 years', '7 years', '8 years', '9 years', '10+ years')) +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) + 
  labs(x = "Years in Current Job", y = "Count", title = "Loan Status by Years in Current Job") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18),
        legend.title = element_text(size = 16),               
        legend.text = element_text(size = 14))                

# Credit Score by Years in Current Job
p2 <- ggplot(df2, aes(x = Years.in.current.job, y = Credit.Score)) +
  geom_boxplot(aes(fill = Years.in.current.job)) +
  scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years', 
                              '6 years', '7 years', '8 years', '9 years', '10+ years')) +
  labs(x = "Years in Current Job", y = "Credit Score", title = "Credit Score by Years in Current Job") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18),
        legend.title = element_text(size = 16),               
        legend.text = element_text(size = 14)) 

p1 / p2

df2$Years.in.current.job <- as.numeric(df2$Years.in.current.job)

10+ years have the highest count of fully Paid loans compared to all other groups which indicates that customers with longer job tenures tend to have better repayment records. Credit scores are stable across all classes.

Outliers Detection

Find the features where zero value makes the majority proportions of the data.

threshold_percentage <- 50

for (col in colnames(num)) {
  most_frequent_value <- names(sort(table(num[[col]]), decreasing = TRUE))[1]
  if (most_frequent_value == 0) {
    frequency <- max(table(num[[col]]))
    percentage <- (frequency / length(num[[col]])) * 100
    if (percentage > threshold_percentage) {
      cat("Column:", col, "\n")
      cat("Unique values:", paste(unique(num[[col]]), collapse = ", "), "\n")
      cat("Most frequent value:", most_frequent_value, "\n")
      cat("Frequency:", frequency, "\n")
      cat("Percentage:", sprintf("%.2f", percentage), "%\n")
      cat("\n")
    }
  }
}

## Column: Number.of.Credit.Problems 
## Unique values: 1, 0, 5, 2, 4, 3, 7, 6, 11, 15, 8, 10, 9, 12 
## Most frequent value: 0 
## Frequency: 74430 
## Percentage: 86.76 %
## 
## Column: Bankruptcies 
## Unique values: 1, 0, 2, 3, 5, 4, 7, 6 
## Most frequent value: 0 
## Frequency: 76864 
## Percentage: 89.59 %
## 
## Column: Tax.Liens 
## Unique values: 0, 1, 4, 2, 3, 6, 5, 11, 15, 7, 10, 9 
## Most frequent value: 0 
## Frequency: 84231 
## Percentage: 98.18 %

Since most of the data points are greater than 0 but still consist of relatively small integers, it’s possible that the outlier removal process excludes all values except 0. Thus, exclude these features from the outlier detection.

num_col <- colnames(num)
remove <- c('Number.of.Credit.Problems', 'Bankruptcies', 'Tax.Liens')
num_col <- setdiff(num_col, remove)

num <- num[, num_col]
head(num)

##   Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1              445412          709       1167493                    8
## 2              262328          725       1198701                   10
## 3            99999999          741       2231892                    8
## 4              347666          721        806949                    3
## 5              176220          725       1198701                    5
## 6              206602          729        896857                   10
##   Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1      5214.74                    17.2                       6
## 2     33295.98                    21.1                      35
## 3     29200.53                    14.9                      18
## 4      8741.90                    12.0                       9
## 5     20639.70                     6.1                      15
## 6     16367.74                    17.3                       6
##   Current.Credit.Balance Maximum.Open.Credit
## 1                 228190              416746
## 2                 229976              850784
## 3                 297996              750090
## 4                 256329              386958
## 5                 253460              427174
## 6                 215308              272448

Visualise outliers using boxplot.

num_long <- num %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

# Create the boxplot
ggplot(num_long, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  # Rotate x-axis labels to 90 degrees
  labs(title = "Outlier", x = "Feature", y = "Value") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "none")

Encoding

encode <- function(data, columns) {
  for (col in columns) {
    # Print unique values before encoding
    cat("Before encoding:", "\n")
    print(unique(data[[col]]))
    
    # Encoding the categorical column
    data[[col]] <- as.integer(factor(data[[col]]))
    
    # Print the encoded unique values
    cat("After encoding:", "\n")
    print(unique(data[[col]]))
    
    cat("\n")
  }
  
  return(data)
}

cat_col <- c('Loan.Status', 'Term', 'Home.Ownership', 'Purpose')

Encode the categorical features of df2.

df2 <- encode(df2, cat_col)

## Before encoding: 
## [1] "Fully Paid"  "Charged Off"
## After encoding: 
## [1] 2 1
## 
## Before encoding: 
## [1] "Short Term" "Long Term" 
## After encoding: 
## [1] 2 1
## 
## Before encoding: 
## [1] "Home Mortgage" "Own Home"      "Rent"         
## After encoding: 
## [1] 1 2 3
## 
## Before encoding: 
## [1] "Home and Property"             "Debt Consolidation"           
## [3] "Business and Entrepreneurship" "Transportation"               
## [5] "Other"                         "Personal Expenses"            
## After encoding: 
## [1] 3 2 1 6 4 5

Variance Inflation Factor (VIF)

VIF is used to detect multicollinearity among independent variables.

Classification Problem

# VIF for classification problem
lm_classification <- lm(as.formula(paste("Loan.Status ~", paste(setdiff(colnames(df2), "Loan.Status"), collapse = "+"))), data = df2)
vif_classification_results <- vif(lm_classification)
vif_classification <- data.frame(Feature = names(vif_classification_results),
                                 VIF = vif_classification_results) %>%
          arrange(desc(VIF))

ggplot(vif_classification, aes(x = reorder(Feature, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "#66C2A5") +
  geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
  coord_flip() +
  labs(
    title = "Variance Inflation Factor (VIF) for Classification Problem Features",
    x = "Feature",
    y = "VIF"
  ) +
  theme_minimal(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5, size =55),
        axis.title.x = element_text(size = 45),
        axis.title.y = element_text(size = 45),
        axis.text = element_text(size = 40))

The VIF of all independent features of the classification problem are acceptable since VIF are less than 10.

Regression Problem

# VIF for classification problem
lm_regression <- lm(as.formula(paste("Credit.Score ~", paste(setdiff(colnames(df2), "Credit.Score"), collapse = "+"))), data = df2)
vif_regression_results <- vif(lm_regression)
vif_regression <- data.frame(Feature = names(vif_regression_results), 
                             VIF = vif_regression_results) %>%
          arrange(desc(VIF))

ggplot(vif_regression, aes(x = reorder(Feature, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "#FC8D62") +
  geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
  coord_flip() +
  labs(
    title = "Variance Inflation Factor (VIF) for Regression Problem Features",
    x = "Feature",
    y = "VIF"
  ) +
  theme_minimal(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5, size =55),
        axis.title.x = element_text(size = 45),
        axis.title.y = element_text(size = 45),
        axis.text = element_text(size = 40))

The VIF of all independent features of the regression problem are acceptable since VIF are less than 10.

Modelling

Classification Problem

For the classification problem, the loan status is predicted using target Loan.Status column: “Fully Paid” for eligible customers & “Charged Off” for ineligible customers.

Data Splitting

The dataset is split into training set (0.8) and test set (0.2) by stratified splitting. The Loan.Status column is defined as the target variable.

set.seed(123)
classet <- createDataPartition(df2$Loan.Status, p=0.8, list=F)

clastrain <- df2[classet, ]
clastest <- df2[-classet, ]

prop.table(table(df2$Loan.Status))

## 
##         1         2 
## 0.2484526 0.7515474

prop.table(table(clastrain$Loan.Status))

## 
##         1         2 
## 0.2492678 0.7507322

prop.table(table(clastest$Loan.Status))

## 
##         1         2 
## 0.2451917 0.7548083

The class distribution of Loan.Status remains the same across all set.

Resampling using SMOTE

To address the imbalanced dataset, SMOTE (Synthetic Minority Oversampling Technique) was applied.

classmote <- SMOTE(X = clastrain[, -1], 
                target = clastrain$Loan.Status, 
                K = 5, 
                dup_size = 2)
clastrain1 <- data.frame(classmote$data)
names(clastrain1)[ncol(clastrain1)] <- "Loan.Status"
clastrain1$Loan.Status <- as.factor(clastrain1$Loan.Status)

clastrain1_x <- as.matrix(clastrain1[, -which(names(clastrain1) == "Loan.Status")])
clastrain1_y <- as.numeric(clastrain1$Loan.Status) - 1

clastest_x <- as.matrix(clastest[, -which(names(clastest) == "Loan.Status")])
clastest_y <- as.numeric(clastest$Loan.Status) - 1

prop.table(table(clastrain$Loan.Status))

## 
##         1         2 
## 0.2492678 0.7507322

prop.table(table(clastrain1$Loan.Status))

## 
##         1         2 
## 0.4990228 0.5009772

The minority class is duplicated 2 times to get an equal distribution with the majority class.

Model Training

Two models were chosen for the project Random Forest & XGBoost.
Random Forest is an ensemble machine learning method based on decision trees and it aggregates the results of multiple decision trees to improve accuracy and reduce overfitting.

rf_classifier <- randomForest(Loan.Status ~ ., data = clastrain1,
                              ntree = 100, 
                              importance = TRUE)

XGBoost (Extreme Gradient Boosting) is an optimized implementation of gradient boosting designed to improve performance and speed.

xgb_classifier <- xgboost(
  data = clastrain1_x,          
  label = clastrain1_y,
  nrounds = 100,
  objective = 'binary:logistic'
)

## [1]  train-logloss:0.596056 
## [2]  train-logloss:0.542586 
## [3]  train-logloss:0.509624 
## [4]  train-logloss:0.486869 
## [5]  train-logloss:0.468468 
## [6]  train-logloss:0.450323 
## [7]  train-logloss:0.440073 
## [8]  train-logloss:0.427220 
## [9]  train-logloss:0.419883 
## [10] train-logloss:0.412799 
## [11] train-logloss:0.409033 
## [12] train-logloss:0.401297 
## [13] train-logloss:0.398308 
## [14] train-logloss:0.393384 
## [15] train-logloss:0.391144 
## [16] train-logloss:0.387098 
## [17] train-logloss:0.383878 
## [18] train-logloss:0.380269 
## [19] train-logloss:0.378728 
## [20] train-logloss:0.377309 
## [21] train-logloss:0.375065 
## [22] train-logloss:0.369018 
## [23] train-logloss:0.361935 
## [24] train-logloss:0.359263 
## [25] train-logloss:0.358279 
## [26] train-logloss:0.355269 
## [27] train-logloss:0.353848 
## [28] train-logloss:0.352044 
## [29] train-logloss:0.351285 
## [30] train-logloss:0.350476 
## [31] train-logloss:0.349817 
## [32] train-logloss:0.346267 
## [33] train-logloss:0.345514 
## [34] train-logloss:0.344774 
## [35] train-logloss:0.342828 
## [36] train-logloss:0.342304 
## [37] train-logloss:0.341778 
## [38] train-logloss:0.338915 
## [39] train-logloss:0.338116 
## [40] train-logloss:0.337389 
## [41] train-logloss:0.337218 
## [42] train-logloss:0.335160 
## [43] train-logloss:0.334314 
## [44] train-logloss:0.333590 
## [45] train-logloss:0.333170 
## [46] train-logloss:0.332305 
## [47] train-logloss:0.331696 
## [48] train-logloss:0.331025 
## [49] train-logloss:0.330768 
## [50] train-logloss:0.330379 
## [51] train-logloss:0.329106 
## [52] train-logloss:0.328854 
## [53] train-logloss:0.328327 
## [54] train-logloss:0.326051 
## [55] train-logloss:0.325576 
## [56] train-logloss:0.324394 
## [57] train-logloss:0.323411 
## [58] train-logloss:0.321972 
## [59] train-logloss:0.321335 
## [60] train-logloss:0.320433 
## [61] train-logloss:0.320057 
## [62] train-logloss:0.319495 
## [63] train-logloss:0.319166 
## [64] train-logloss:0.317997 
## [65] train-logloss:0.317380 
## [66] train-logloss:0.316968 
## [67] train-logloss:0.316569 
## [68] train-logloss:0.315874 
## [69] train-logloss:0.315299 
## [70] train-logloss:0.314597 
## [71] train-logloss:0.313909 
## [72] train-logloss:0.313707 
## [73] train-logloss:0.313235 
## [74] train-logloss:0.312517 
## [75] train-logloss:0.311922 
## [76] train-logloss:0.311067 
## [77] train-logloss:0.310525 
## [78] train-logloss:0.310114 
## [79] train-logloss:0.309645 
## [80] train-logloss:0.309160 
## [81] train-logloss:0.308734 
## [82] train-logloss:0.308627 
## [83] train-logloss:0.308426 
## [84] train-logloss:0.307926 
## [85] train-logloss:0.307294 
## [86] train-logloss:0.307170 
## [87] train-logloss:0.307060 
## [88] train-logloss:0.306673 
## [89] train-logloss:0.306314 
## [90] train-logloss:0.306202 
## [91] train-logloss:0.305896 
## [92] train-logloss:0.305518 
## [93] train-logloss:0.304962 
## [94] train-logloss:0.304567 
## [95] train-logloss:0.304233 
## [96] train-logloss:0.303952 
## [97] train-logloss:0.303403 
## [98] train-logloss:0.302593 
## [99] train-logloss:0.302280 
## [100]    train-logloss:0.301792

Test the Model

rf_clas_pred_prob <- predict(rf_classifier, newdata = clastest, type = "prob")
rf_clas_pred <- ifelse(rf_clas_pred_prob[, 2] > 0.5, 1, 0)
xgb_clas_pred_prob <- predict(xgb_classifier, clastest_x)
xgb_clas_pred <- ifelse(xgb_clas_pred_prob > 0.5, 1, 0)

Regression Problem

Data Splitting

The dataset is split into training set (0.8) and test set (0.2). The Credit.Score column is defined as the target variable.

set.seed(123)
regset <- createDataPartition(df2$Credit.Score, p=0.8, list=F)

regtrain <- df2[regset, ]
regtest <- df2[-regset, ]

regtrain_x <- as.matrix(regtrain[, -which(names(regtrain) == "Credit.Score")])
regtrain_y <- regtrain$Credit.Score

regtest_x <- as.matrix(regtest[, -which(names(regtest) == "Credit.Score")])
regtest_y <- regtest$Credit.Score

Model Training

Similar to the classification problem, Random Forest & XGBoost were also choosen for the regression problem
Random Forest

rf_reg <- randomForest(Credit.Score ~ ., data=regtrain, ntree = 50)

XGBoost

xgb_reg <- xgboost(
  data = regtrain_x,          
  label = regtrain_y,
  nrounds = 100
)

## [1]  train-rmse:504.152902 
## [2]  train-rmse:353.258614 
## [3]  train-rmse:247.772815 
## [4]  train-rmse:174.113107 
## [5]  train-rmse:122.808234 
## [6]  train-rmse:87.233804 
## [7]  train-rmse:62.787166 
## [8]  train-rmse:46.273409 
## [9]  train-rmse:35.424976 
## [10] train-rmse:28.599793 
## [11] train-rmse:24.494535 
## [12] train-rmse:22.172118 
## [13] train-rmse:20.899739 
## [14] train-rmse:20.196642 
## [15] train-rmse:19.802898 
## [16] train-rmse:19.535091 
## [17] train-rmse:19.395052 
## [18] train-rmse:19.305943 
## [19] train-rmse:19.223774 
## [20] train-rmse:19.146398 
## [21] train-rmse:19.062564 
## [22] train-rmse:19.009610 
## [23] train-rmse:18.957081 
## [24] train-rmse:18.904423 
## [25] train-rmse:18.859871 
## [26] train-rmse:18.806744 
## [27] train-rmse:18.785699 
## [28] train-rmse:18.760868 
## [29] train-rmse:18.716777 
## [30] train-rmse:18.696336 
## [31] train-rmse:18.660398 
## [32] train-rmse:18.616775 
## [33] train-rmse:18.597696 
## [34] train-rmse:18.551940 
## [35] train-rmse:18.538307 
## [36] train-rmse:18.487860 
## [37] train-rmse:18.451868 
## [38] train-rmse:18.444705 
## [39] train-rmse:18.405782 
## [40] train-rmse:18.382120 
## [41] train-rmse:18.365880 
## [42] train-rmse:18.344140 
## [43] train-rmse:18.322081 
## [44] train-rmse:18.289515 
## [45] train-rmse:18.266673 
## [46] train-rmse:18.259950 
## [47] train-rmse:18.234204 
## [48] train-rmse:18.209596 
## [49] train-rmse:18.173723 
## [50] train-rmse:18.144786 
## [51] train-rmse:18.126122 
## [52] train-rmse:18.093391 
## [53] train-rmse:18.066887 
## [54] train-rmse:18.045779 
## [55] train-rmse:18.020931 
## [56] train-rmse:17.995193 
## [57] train-rmse:17.975349 
## [58] train-rmse:17.942384 
## [59] train-rmse:17.926082 
## [60] train-rmse:17.906469 
## [61] train-rmse:17.884972 
## [62] train-rmse:17.864283 
## [63] train-rmse:17.845357 
## [64] train-rmse:17.830017 
## [65] train-rmse:17.822264 
## [66] train-rmse:17.806078 
## [67] train-rmse:17.803143 
## [68] train-rmse:17.777942 
## [69] train-rmse:17.745179 
## [70] train-rmse:17.740964 
## [71] train-rmse:17.708373 
## [72] train-rmse:17.695856 
## [73] train-rmse:17.671558 
## [74] train-rmse:17.633395 
## [75] train-rmse:17.615272 
## [76] train-rmse:17.598380 
## [77] train-rmse:17.588348 
## [78] train-rmse:17.566782 
## [79] train-rmse:17.542397 
## [80] train-rmse:17.528464 
## [81] train-rmse:17.502755 
## [82] train-rmse:17.491953 
## [83] train-rmse:17.478933 
## [84] train-rmse:17.455682 
## [85] train-rmse:17.442524 
## [86] train-rmse:17.435676 
## [87] train-rmse:17.424583 
## [88] train-rmse:17.407363 
## [89] train-rmse:17.381464 
## [90] train-rmse:17.365696 
## [91] train-rmse:17.363838 
## [92] train-rmse:17.360838 
## [93] train-rmse:17.345311 
## [94] train-rmse:17.321054 
## [95] train-rmse:17.302149 
## [96] train-rmse:17.270130 
## [97] train-rmse:17.250148 
## [98] train-rmse:17.225718 
## [99] train-rmse:17.205127 
## [100]    train-rmse:17.182926

Test the Model

rf_reg_pred <- predict(rf_reg, newdata = regtest_x)
xgb_reg_pred <- predict(xgb_reg, regtest_x)

Evaluation

Classification Problem

# Random Forest performance
rf_conf_matrix <- confusionMatrix(factor(rf_clas_pred), factor(clastest_y))
rf_accuracy <- rf_conf_matrix$overall["Accuracy"]
rf_precision <- rf_conf_matrix$byClass["Pos Pred Value"]
rf_recall <- rf_conf_matrix$byClass["Sensitivity"]
rf_f1_score <- rf_conf_matrix$byClass["F1"]
roc_rf <- roc(clastest_y, rf_clas_pred_prob[, 2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

rf_auc_roc <- auc(roc_rf)

# XGBoost performance
xgb_conf_matrix <- confusionMatrix(factor(xgb_clas_pred), factor(clastest_y))
xgb_accuracy <- xgb_conf_matrix$overall["Accuracy"]
xgb_precision <- xgb_conf_matrix$byClass["Pos Pred Value"]
xgb_recall <- xgb_conf_matrix$byClass["Sensitivity"]
xgb_f1_score <- xgb_conf_matrix$byClass["F1"]
xgb_roc <- roc(clastest_y, xgb_clas_pred_prob)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

xgb_auc_roc <- auc(xgb_roc)

classification_results <- data.frame(
  Model = c("Random Forest", "XGBoost"),
  Accuracy = c(rf_accuracy, xgb_accuracy),
  Precision = c(rf_precision, xgb_precision),
  Recall = c(rf_recall, xgb_recall),
  F1_Score = c(rf_f1_score, xgb_f1_score),
  AUC_ROC = c(rf_auc_roc, xgb_auc_roc)
)

classification_results

##           Model  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1 Random Forest 0.7458911 0.4567063 0.1918232 0.2701707 0.7037483
## 2       XGBoost 0.7555076 0.5054054 0.1333492 0.2110213 0.6988339

The machine learning models were evaluated based on their performance metrics: Accuracy, Precision, Recall, F1-Score and AUC-ROC. XGBoost has a slightly better accuracy & precision compared to Random Forest. However, Random Forest performs slightly better in terms of the recall, F1-Score and AUC ROC

Regression Problem

evaluate_regression <- function(predictions, actuals) {
  # Using caret
  r_squared <- R2(pred = predictions, obs = actuals)
  rmse <- RMSE(pred = predictions, obs = actuals)
  
  # Compile results into a data frame
  results <- data.frame(R_squared = r_squared, RMSE = rmse)
  return(results)
}

rf_reg_results <- evaluate_regression(rf_reg_pred, regtest_y)
xgb_reg_results <- evaluate_regression(xgb_reg_pred, regtest_y)

regression_results <- data.frame(
  Model = c("Random Forest Regressor", "XGBoost Regressor"),
  R_Squared = c(rf_reg_results$R_squared, xgb_reg_results$R_squared),
  RMSE = c(rf_reg_results$RMSE, xgb_reg_results$RMSE)
)

regression_results

##                     Model R_Squared     RMSE
## 1 Random Forest Regressor 0.3828138 19.68817
## 2       XGBoost Regressor 0.3827512 19.68262

For the regression problem, the models were evaluated based on R-Squared and Root Mean Squared Error (RMSE). Both models have nearly identical performance for R-Squared where the difference is only 0.0000626. This is also a similar case for RMSE where they are almost identical with XGBoost has a slightly lower RMSE.

Conclusion

For loan eligibility prediction, both Random Forest and XGBoost were effective. However, the choice between them depends on the problem’s priorities: If accuracy and precision are crucial, XGBoost is slightly better. If recall and overall balance are more critical, Random Forest performs better.

For predicting credit scores, both models demonstrated similar performance, explaining approximately 38% of the variance in the data. The minimal difference in RMSE suggests either model can be used effectively.

This project successfully demonstrated the application of machine learning for both classification and regression tasks in the context of loan eligibility and credit score prediction. The models provide actionable insights that can aid financial institutions in assessing loan risk and creditworthiness. Future improvements could focus on hyperparameter tuning and feature engineering for better performance.