Group 9: Loan Eligibility Prediction and Default Risk

Group Members:

  1. Determine the details about the dataset (e.g. title, year, purpose of dataset, dimension, content, structure, and the summary) by exploring the raw data.
  2. Data cleaning: Putting it all together as a new cleaned/processed dataset. Which section of the data do you need to tidy? You can use any cleaning packages in R
  3. Data analysis: Identify the questions, what is the objective/goal of processing this dataset? What answers are you interested to find from this dataset? You must have at least two questions. It is best to have different types of problems, i.e. one regression, and one classification.

Introduction: In the past, banks used a manual and often unreliable process to review loan applications. This traditional method raised some struggles to accurately predict the risk of default. This project leverages machine learning techniques in R to develop a Loan Prediction system. By automating and enhancing the assessment process, this system aims to provide faster and more accurate evaluations, ultimately helping banks minimize losses from loan defaults.

Objectives: 1. To predict potential customer’s ability to repay their loan based on the information obtained from existing customers. 2. To develop a classification model to predict the applicant’s approval for a loan based on their financial and demographic information. 3. To create a regression model to estimate the probability of default for each applicant. 4. To improve the accuracy of loan eligibility predictions to enhance the identification of potential defaulters and minimize banks’ financial risks.

Questions: 1. Classification problem - Which applicants are likely to be eligible for a loan? (Target Variable: Loan.Status) 2. Regression problem - What is the expected default risk for each applicant? (Target Variable: Credit.Score)

Install and Import Libraries

# install.packages("psych")
library(psych)
# install.packages("tidyverse")
library(tidyverse)
# install.packages("scales")
library(scales)
# install.packages("stats")
library(stats)
# install.packages("patchwork")
library(patchwork)
# install.packages("corrplot")
library(corrplot)
# install.packages("car")
library(car)
# install.packages("caret")
library(caret)
# install.packages("smotefamily")
library(smotefamily)
# install.packages('randomForest')
library(randomForest)
# install.packages('xgboost')
library(xgboost)
# install.packages('pROC')
library(pROC)

Loading Data

df_raw <- read.csv("https://github.com/bilecops/Projects/blob/71b0f6e4f715355606e8e2ed3cbb729d16a105ab/credit_train.csv?raw=true")

Data Exploration

colnames(df_raw)
##  [1] "Loan.ID"                      "Customer.ID"                 
##  [3] "Loan.Status"                  "Current.Loan.Amount"         
##  [5] "Term"                         "Credit.Score"                
##  [7] "Annual.Income"                "Years.in.current.job"        
##  [9] "Home.Ownership"               "Purpose"                     
## [11] "Monthly.Debt"                 "Years.of.Credit.History"     
## [13] "Months.since.last.delinquent" "Number.of.Open.Accounts"     
## [15] "Number.of.Credit.Problems"    "Current.Credit.Balance"      
## [17] "Maximum.Open.Credit"          "Bankruptcies"                
## [19] "Tax.Liens"
head(df_raw)
##                                Loan.ID                          Customer.ID
## 1 14dd8831-6af5-400b-83ec-68e61888a048 981165ec-3274-42f5-a3b4-d104041a9ca9
## 2 4771cc26-131a-45db-b5aa-537ea4ba5342 2de017a3-2e01-49cb-a581-08169e83be29
## 3 4eed4e6a-aa2f-4c91-8651-ce984ee8fb26 5efb2b2b-bf11-4dfd-a572-3761a2694725
## 4 77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a e777faab-98ae-45af-9a86-7ce5b33b1011
## 5 d4062e70-befa-4995-8643-a0de73938182 81536ad9-5ccf-4eb8-befb-47a4d608658e
## 6 89d8cb0c-e5c2-4f54-b056-48a645c543dd 4ffe99d3-7f2a-44db-afc1-40943f1f9750
##   Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
## 1  Fully Paid              445412 Short Term          709       1167493
## 2  Fully Paid              262328 Short Term           NA            NA
## 3  Fully Paid            99999999 Short Term          741       2231892
## 4  Fully Paid              347666  Long Term          721        806949
## 5  Fully Paid              176220 Short Term           NA            NA
## 6 Charged Off              206602 Short Term         7290        896857
##   Years.in.current.job Home.Ownership            Purpose Monthly.Debt
## 1              8 years  Home Mortgage  Home Improvements      5214.74
## 2            10+ years  Home Mortgage Debt Consolidation     33295.98
## 3              8 years       Own Home Debt Consolidation     29200.53
## 4              3 years       Own Home Debt Consolidation      8741.90
## 5              5 years           Rent Debt Consolidation     20639.70
## 6            10+ years  Home Mortgage Debt Consolidation     16367.74
##   Years.of.Credit.History Months.since.last.delinquent Number.of.Open.Accounts
## 1                    17.2                           NA                       6
## 2                    21.1                            8                      35
## 3                    14.9                           29                      18
## 4                    12.0                           NA                       9
## 5                     6.1                           NA                      15
## 6                    17.3                           NA                       6
##   Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1                         1                 228190              416746
## 2                         0                 229976              850784
## 3                         1                 297996              750090
## 4                         0                 256329              386958
## 5                         0                 253460              427174
## 6                         0                 215308              272448
##   Bankruptcies Tax.Liens
## 1            1         0
## 2            0         0
## 3            0         0
## 4            0         0
## 5            0         0
## 6            0         0
tail(df_raw)
##        Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509                                                  NA                NA
## 100510                                                  NA                NA
## 100511                                                  NA                NA
## 100512                                                  NA                NA
## 100513                                                  NA                NA
## 100514                                                  NA                NA
##        Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509            NA                                                       NA
## 100510            NA                                                       NA
## 100511            NA                                                       NA
## 100512            NA                                                       NA
## 100513            NA                                                       NA
## 100514            NA                                                       NA
##        Years.of.Credit.History Months.since.last.delinquent
## 100509                      NA                           NA
## 100510                      NA                           NA
## 100511                      NA                           NA
## 100512                      NA                           NA
## 100513                      NA                           NA
## 100514                      NA                           NA
##        Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509                      NA                        NA                     NA
## 100510                      NA                        NA                     NA
## 100511                      NA                        NA                     NA
## 100512                      NA                        NA                     NA
## 100513                      NA                        NA                     NA
## 100514                      NA                        NA                     NA
##        Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509                  NA           NA        NA
## 100510                  NA           NA        NA
## 100511                  NA           NA        NA
## 100512                  NA           NA        NA
## 100513                  NA           NA        NA
## 100514                  NA           NA        NA
df_raw[df_raw == "" | trimws(df_raw) == ""] <- NA
tail(df_raw)
##        Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100510    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100511    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100512    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100513    <NA>        <NA>        <NA>                  NA <NA>           NA
## 100514    <NA>        <NA>        <NA>                  NA <NA>           NA
##        Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509            NA                 <NA>           <NA>    <NA>           NA
## 100510            NA                 <NA>           <NA>    <NA>           NA
## 100511            NA                 <NA>           <NA>    <NA>           NA
## 100512            NA                 <NA>           <NA>    <NA>           NA
## 100513            NA                 <NA>           <NA>    <NA>           NA
## 100514            NA                 <NA>           <NA>    <NA>           NA
##        Years.of.Credit.History Months.since.last.delinquent
## 100509                      NA                           NA
## 100510                      NA                           NA
## 100511                      NA                           NA
## 100512                      NA                           NA
## 100513                      NA                           NA
## 100514                      NA                           NA
##        Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509                      NA                        NA                     NA
## 100510                      NA                        NA                     NA
## 100511                      NA                        NA                     NA
## 100512                      NA                        NA                     NA
## 100513                      NA                        NA                     NA
## 100514                      NA                        NA                     NA
##        Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509                  NA           NA        NA
## 100510                  NA           NA        NA
## 100511                  NA           NA        NA
## 100512                  NA           NA        NA
## 100513                  NA           NA        NA
## 100514                  NA           NA        NA
sum(apply(df_raw, 1, function(row) all(is.na(row))))
## [1] 514
df1 <- df_raw[rowSums(is.na(df_raw)) != ncol(df_raw), ]
cat("Number of row: ", nrow(df1), "\n", "Number of column: ", ncol(df1), sep="")
## Number of row: 100000
## Number of column: 19
lapply(df1, class)
## $Loan.ID
## [1] "character"
## 
## $Customer.ID
## [1] "character"
## 
## $Loan.Status
## [1] "character"
## 
## $Current.Loan.Amount
## [1] "integer"
## 
## $Term
## [1] "character"
## 
## $Credit.Score
## [1] "integer"
## 
## $Annual.Income
## [1] "integer"
## 
## $Years.in.current.job
## [1] "character"
## 
## $Home.Ownership
## [1] "character"
## 
## $Purpose
## [1] "character"
## 
## $Monthly.Debt
## [1] "numeric"
## 
## $Years.of.Credit.History
## [1] "numeric"
## 
## $Months.since.last.delinquent
## [1] "integer"
## 
## $Number.of.Open.Accounts
## [1] "integer"
## 
## $Number.of.Credit.Problems
## [1] "integer"
## 
## $Current.Credit.Balance
## [1] "integer"
## 
## $Maximum.Open.Credit
## [1] "integer"
## 
## $Bankruptcies
## [1] "integer"
## 
## $Tax.Liens
## [1] "integer"
describe(df1)
##                              vars      n        mean          sd    median
## Loan.ID*                        1 100000    41014.91    23696.13   41036.5
## Customer.ID*                    2 100000    41016.45    23686.34   41020.5
## Loan.Status*                    3 100000        1.77        0.42       2.0
## Current.Loan.Amount             4 100000 11760447.39 31783942.55  312246.0
## Term*                           5 100000        1.72        0.45       2.0
## Credit.Score                    6  80846     1076.46     1475.40     724.0
## Annual.Income                   7  80846  1378276.56  1081360.20 1174162.0
## Years.in.current.job*           8 100000        5.12        3.10       4.0
## Home.Ownership*                 9 100000        2.93        0.95       3.0
## Purpose*                       10 100000        4.80        2.21       4.0
## Monthly.Debt                   11 100000    18472.41    12174.99   16220.3
## Years.of.Credit.History        12 100000       18.20        7.02      16.9
## Months.since.last.delinquent   13  46859       34.90       22.00      32.0
## Number.of.Open.Accounts        14 100000       11.13        5.01      10.0
## Number.of.Credit.Problems      15 100000        0.17        0.48       0.0
## Current.Credit.Balance         16 100000   294637.38   376170.93  209817.0
## Maximum.Open.Credit            17  99998   760798.38  8384503.47  467874.0
## Bankruptcies                   18  99796        0.12        0.35       0.0
## Tax.Liens                      19  99990        0.03        0.26       0.0
##                                 trimmed       mad     min          max
## Loan.ID*                       41019.01  30426.66     1.0      81999.0
## Customer.ID*                   41016.79  30410.35     1.0      81999.0
## Loan.Status*                       1.84      0.00     1.0          2.0
## Current.Loan.Amount          2191409.03 218926.65 10802.0   99999999.0
## Term*                              1.78      0.00     1.0          2.0
## Credit.Score                     722.47     25.20   585.0       7510.0
## Annual.Income                1249929.04 564430.27 76627.0  165557393.0
## Years.in.current.job*              4.85      2.97     1.0         12.0
## Home.Ownership*                    2.92      1.48     1.0          4.0
## Purpose*                           4.25      0.00     1.0         16.0
## Monthly.Debt                   17090.61   9969.71     0.0     435843.3
## Years.of.Credit.History           17.53      5.93     3.6         70.5
## Months.since.last.delinquent      33.63     25.20     0.0        176.0
## Number.of.Open.Accounts           10.66      4.45     0.0         76.0
## Number.of.Credit.Problems          0.05      0.00     0.0         15.0
## Current.Credit.Balance        239638.11 170509.38     0.0   32878968.0
## Maximum.Open.Credit           526283.54 342333.82     0.0 1539737892.0
## Bankruptcies                       0.01      0.00     0.0          7.0
## Tax.Liens                          0.00      0.00     0.0         15.0
##                                     range   skew kurtosis        se
## Loan.ID*                          81998.0   0.00    -1.20     74.93
## Customer.ID*                      81998.0   0.00    -1.20     74.90
## Loan.Status*                          1.0  -1.31    -0.29      0.00
## Current.Loan.Amount            99989197.0   2.42     3.84 100509.65
## Term*                                 1.0  -0.99    -1.02      0.00
## Credit.Score                       6925.0   3.86    12.97      5.19
## Annual.Income                 165480766.0  46.89  6623.59   3803.13
## Years.in.current.job*                11.0   0.72    -0.59      0.01
## Home.Ownership*                       3.0   0.12    -1.86      0.00
## Purpose*                             15.0   2.21     4.55      0.01
## Monthly.Debt                     435843.3   2.21    22.19     38.50
## Years.of.Credit.History              66.9   1.07     1.74      0.02
## Months.since.last.delinquent        176.0   0.43    -0.75      0.10
## Number.of.Open.Accounts              76.0   1.18     3.04      0.02
## Number.of.Credit.Problems            15.0   4.82    48.01      0.00
## Current.Credit.Balance         32878968.0  14.15   697.45   1189.56
## Maximum.Open.Credit          1539737892.0 132.63 20393.41  26514.39
## Bankruptcies                          7.0   3.51    18.53      0.00
## Tax.Liens                            15.0  15.50   402.04      0.00
lapply(df1, function(x) length(unique(x)))
## $Loan.ID
## [1] 81999
## 
## $Customer.ID
## [1] 81999
## 
## $Loan.Status
## [1] 2
## 
## $Current.Loan.Amount
## [1] 22004
## 
## $Term
## [1] 2
## 
## $Credit.Score
## [1] 325
## 
## $Annual.Income
## [1] 36175
## 
## $Years.in.current.job
## [1] 12
## 
## $Home.Ownership
## [1] 4
## 
## $Purpose
## [1] 16
## 
## $Monthly.Debt
## [1] 65765
## 
## $Years.of.Credit.History
## [1] 506
## 
## $Months.since.last.delinquent
## [1] 117
## 
## $Number.of.Open.Accounts
## [1] 51
## 
## $Number.of.Credit.Problems
## [1] 14
## 
## $Current.Credit.Balance
## [1] 32730
## 
## $Maximum.Open.Credit
## [1] 44597
## 
## $Bankruptcies
## [1] 9
## 
## $Tax.Liens
## [1] 13
columns <- names(Filter(function(x) length(unique(x)) <= 16, df1))

for (col in columns) {
  cat(col, ":\n")
  cat("Value counts:\n")
  print(table(df1[[col]]))
  cat("\n", paste(rep("__", 20), collapse = ""), "\n\n")
}
## Loan.Status :
## Value counts:
## 
## Charged Off  Fully Paid 
##       22639       77361 
## 
##  ________________________________________ 
## 
## Term :
## Value counts:
## 
##  Long Term Short Term 
##      27792      72208 
## 
##  ________________________________________ 
## 
## Years.in.current.job :
## Value counts:
## 
##  < 1 year    1 year 10+ years   2 years   3 years   4 years   5 years   6 years 
##      8164      6460     31121      9134      8169      6143      6787      5686 
##   7 years   8 years   9 years       n/a 
##      5577      4582      3955      4222 
## 
##  ________________________________________ 
## 
## Home.Ownership :
## Value counts:
## 
##  HaveMortgage Home Mortgage      Own Home          Rent 
##           214         48410          9182         42194 
## 
##  ________________________________________ 
## 
## Purpose :
## Value counts:
## 
##        Business Loan            Buy a Car            Buy House 
##                 1569                 1265                  678 
##   Debt Consolidation Educational Expenses    Home Improvements 
##                78552                   99                 5839 
##       major_purchase        Medical Bills               moving 
##                  352                 1127                  150 
##                other                Other     renewable_energy 
##                 6037                 3250                   10 
##       small_business          Take a Trip             vacation 
##                  283                  573                  101 
##              wedding 
##                  115 
## 
##  ________________________________________ 
## 
## Number.of.Credit.Problems :
## Value counts:
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 86035 12077  1299   378   125    49    17     8     4     2     2     2     1 
##    15 
##     1 
## 
##  ________________________________________ 
## 
## Bankruptcies :
## Value counts:
## 
##     0     1     2     3     4     5     6     7 
## 88774 10475   417    93    27     7     2     1 
## 
##  ________________________________________ 
## 
## Tax.Liens :
## Value counts:
## 
##     0     1     2     3     4     5     6     7     9    10    11    15 
## 98062  1343   374   111    58    16    12     7     3     1     2     1 
## 
##  ________________________________________
df1$Years.in.current.job[df1$Years.in.current.job == "n/a"] <- NA
output <- paste(
  "Credit.Score:\n",
  "Value counts:\n",
  paste(capture.output(print(table(df1[["Credit.Score"]]))), collapse = "\n"),
  sep = ""
)
cat(output)
## Credit.Score:
## Value counts:
## 
##  585  586  587  588  589  590  591  592  593  594  595  596  597  598  599  600 
##   12    7   11   20    6    8    9    4    7   10   19   15   23    9   15   25 
##  601  602  603  604  605  606  607  608  609  610  611  612  613  614  615  616 
##   29   16   19   13   23   20   13   19   26   28   29   32   23   33   30   34 
##  617  618  619  620  621  622  623  624  625  626  627  628  629  630  631  632 
##   31   32   42   26   41   50   28   52   36   39   43   44   30   35   33   67 
##  633  634  635  636  637  638  639  640  641  642  643  644  645  646  647  648 
##   58   45   72   62   47   71   42   88   77   81   87   73   99   75   71   76 
##  649  650  651  652  653  654  655  656  657  658  659  660  661  662  663  664 
##  126   98  101  122  118  129  147  171  123  118  148  144  168  137  182  173 
##  665  666  667  668  669  670  671  672  673  674  675  676  677  678  679  680 
##  161  167  176  227  230  227  208  219  212  262  262  284  234  298  285  346 
##  681  682  683  684  685  686  687  688  689  690  691  692  693  694  695  696 
##  277  294  365  327  369  369  367  338  384  320  386  504  509  506  506  483 
##  697  698  699  700  701  702  703  704  705  706  707  708  709  710  711  712 
##  552  595  660  612  620  644  717  716  727  791  814  902  865  857  791 1006 
##  713  714  715  716  717  718  719  720  721  722  723  724  725  726  727  728 
## 1017 1046 1070 1061 1218 1261 1118 1216 1465 1387 1421 1522 1548  853  806  931 
##  729  730  731  732  733  734  735  736  737  738  739  740  741  742  743  744 
##  925  984 1010 1084 1073 1147 1134 1156 1405 1495 1624 1746 1732 1723 1555 1485 
##  745  746  747  748  749  750  751 5850 5860 5890 5900 5920 5930 5940 5950 5960 
## 1612 1742 1825 1598  827 1234  723    2    1    1    1    1    2    3    3    1 
## 6010 6050 6060 6070 6080 6090 6100 6110 6120 6130 6140 6150 6160 6170 6180 6190 
##    1    1    2    1    2    4    2    4    2    3    2    3    6    4    5    2 
## 6200 6210 6220 6230 6240 6250 6260 6270 6280 6290 6300 6310 6320 6330 6340 6350 
##    2    3    7    6    8    4    7    4    3    5    6    4    7    6    4    6 
## 6360 6370 6380 6390 6400 6410 6420 6430 6440 6450 6460 6470 6480 6490 6500 6510 
##    4    2    9    7    9   12    6    4    7   13    3    6    6    7   13    9 
## 6520 6530 6540 6550 6560 6570 6580 6590 6600 6610 6620 6630 6640 6650 6660 6670 
##   15   12    7    8   12   11   13   12   11   13   13   21    9   25   19   17 
## 6680 6690 6700 6710 6720 6730 6740 6750 6760 6770 6780 6790 6800 6810 6820 6830 
##   17   16   17   14   17   16   24   22   12   16   20   18   25   30   21   24 
## 6840 6850 6860 6870 6880 6890 6900 6910 6920 6930 6940 6950 6960 6970 6980 6990 
##   35   25   28   27   24   29   25   22   29   36   38   39   38   32   36   46 
## 7000 7010 7020 7030 7040 7050 7060 7070 7080 7090 7100 7110 7120 7130 7140 7150 
##   35   46   37   39   49   44   45   52   50   58   54   45   51   53   62   45 
## 7160 7170 7180 7190 7200 7210 7220 7230 7240 7250 7260 7270 7280 7290 7300 7310 
##   68   74   57   68   63   57   59   63   84   75   64   74   77   60   73   73 
## 7320 7330 7340 7350 7360 7370 7380 7390 7400 7410 7420 7430 7440 7450 7460 7470 
##   83   88   87   76   85  106   93  108   97   87   84   70   58   55   76   51 
## 7480 7490 7500 7510 
##   43   23   24    9
counts <- table(df1$`Loan.Status`)
percentages <- prop.table(counts) * 100
status <- c("Charged Off", "Fully Paid")
labels <- paste(status, "\n", "Count:", counts, "\n", "(", round(percentages, 1), "%)")
par(pin = c(3, 3))

pie(percentages, 
    labels = labels, 
    col = c("red", "green"), 
    main = 'Percentage distribution of "Loan Status" column',
    cex = 1.5)

Data Exploration Report

Data Cleaning

Duplicates

df2 <- data.frame(df1)
cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")
## Number of duplicates: 10215
## Percentage of duplicates: 10.215%
df2 <- df2[!duplicated(df2),]
cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")
## Number of duplicates: 0
## Percentage of duplicates: 0%

Missing Values

for (col in colnames(df2)) {
  if (any(is.na(df2[[col]]))) {  # Access column by name
    cat("Column: ", "\"", col, "\"", "\n", 
        "Missing Values Count: ", sum(is.na(df2[[col]])), "\n", 
        "Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
        "_____________________________\n\n", sep="")
  }
}
## Column: "Credit.Score"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
## 
## Column: "Annual.Income"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
## 
## Column: "Years.in.current.job"
## Missing Values Count: 3802
## Percentage: 4.23%
## _____________________________
## 
## Column: "Months.since.last.delinquent"
## Missing Values Count: 48337
## Percentage: 53.84%
## _____________________________
## 
## Column: "Maximum.Open.Credit"
## Missing Values Count: 2
## Percentage: 0%
## _____________________________
## 
## Column: "Bankruptcies"
## Missing Values Count: 190
## Percentage: 0.21%
## _____________________________
## 
## Column: "Tax.Liens"
## Missing Values Count: 9
## Percentage: 0.01%
## _____________________________

Comments:

df2$Months.since.last.delinquent <- NULL
df2 <- df2[complete.cases(df2[, c("Years.in.current.job", "Bankruptcies", "Tax.Liens", "Maximum.Open.Credit")]), ]
cat("Number of row: ", nrow(df2), "\n", "Number of column: ", ncol(df2), sep="")
## Number of row: 85791
## Number of column: 18
for (col in colnames(df2)) {
  if (any(is.na(df2[[col]]))) {  # Access column by name
    cat("Column: ", "\"", col, "\"", "\n", 
        "Missing Values Count: ", sum(is.na(df2[[col]])), "\n", 
        "Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
        "_____________________________\n\n", sep="")
  }
}
## Column: "Credit.Score"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________
## 
## Column: "Annual.Income"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________

Handling Anomalies in “Credit.Score”

hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

df2$Credit.Score <- ifelse(df2$Credit.Score > 850, 
                           df2$Credit.Score / 10, 
                           df2$Credit.Score)
hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

Imputation for missing values in “Credit.Score”

describe(df2["Credit.Score"])
##              vars     n   mean    sd median trimmed   mad min max range  skew
## Credit.Score    1 67490 718.55 28.18    725  722.81 23.72 585 751   166 -1.45
##              kurtosis   se
## Credit.Score     2.31 0.11
# Calculate the median of Credit.Score, excluding missing values
median_value <- median(df2$Credit.Score, na.rm = TRUE)
cat('Median for "Credit.Score":', median_value)
## Median for "Credit.Score": 725
# Replace missing values in Credit.Score with the median
df2$Credit.Score[is.na(df2$Credit.Score)] <- median_value
hist(df2$Credit.Score, 
     main = "Distribution of Credit.Score", 
     xlab = "Credit.Score", 
     col = "skyblue", 
     border = "white")

describe(df2["Credit.Score"])
##              vars     n   mean    sd median trimmed   mad min max range  skew
## Credit.Score    1 85791 719.93 25.14    725  723.84 17.79 585 751   166 -1.77
##              kurtosis   se
## Credit.Score     3.98 0.09

Imputation for missing values in “Annual.Income”

hist(log(df2$Annual.Income), 
     main = "Distribution of Annual.Income", 
     xlab = "Annual.Income", 
     col = "skyblue", 
     border = "white")

describe(df2["Annual.Income"])
##               vars     n    mean      sd  median trimmed      mad   min
## Annual.Income    1 67490 1396654 1118572 1198701 1266145 570050.1 76627
##                     max     range  skew kurtosis      se
## Annual.Income 165557393 165480766 50.36  6924.17 4305.71
# Calculate the median of Annual.Income, excluding missing values
median_value <- median(df2$Annual.Income, na.rm = TRUE)
cat('Median for "Annual.Income":', median_value)
## Median for "Annual.Income": 1198701
# Replace missing values in Annual.Income with the median
df2$Annual.Income[is.na(df2$Annual.Income)] <- median_value
hist(log(df2$Annual.Income), 
     main = "Distribution of Annual.Income", 
     xlab = "Annual.Income", 
     col = "skyblue", 
     border = "white")

describe(df2["Annual.Income"])
##               vars     n    mean       sd  median trimmed      mad   min
## Annual.Income    1 85791 1354427 995424.1 1198701 1238856 397568.8 76627
##                     max     range  skew kurtosis     se
## Annual.Income 165557393 165480766 56.34   8695.7 3398.5

Mapping Years in current job

unique(df2$Years.in.current.job)
##  [1] "8 years"   "10+ years" "3 years"   "5 years"   "< 1 year"  "2 years"  
##  [7] "4 years"   "9 years"   "7 years"   "1 year"    "6 years"
job_years_mapping <- list(
  "< 1 year" = 1,
  "1 year" = 1,
  "2 years" = 2,
  "3 years" = 3,
  "4 years" = 4,
  "5 years" = 5,
  "6 years" = 6,
  "7 years" = 7,
  "8 years" = 8,
  "9 years" = 9,
  "10+ years" = 10
)
df2$`Years.in.current.job` <- sapply(df2$`Years.in.current.job`, function(x) job_years_mapping[[x]])
unique(df2$Years.in.current.job)
##  [1]  8 10  3  5  1  2  4  9  7  6

Mapping Purpose

unique(df2$Purpose)
##  [1] "Home Improvements"    "Debt Consolidation"   "Buy House"           
##  [4] "Business Loan"        "Buy a Car"            "other"               
##  [7] "major_purchase"       "Take a Trip"          "Other"               
## [10] "small_business"       "Medical Bills"        "wedding"             
## [13] "vacation"             "Educational Expenses" "moving"              
## [16] "renewable_energy"
crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab
##                       
##                        Charged Off Fully Paid
##   Business Loan                465        857
##   Buy a Car                    189        922
##   Buy House                    135        426
##   Debt Consolidation         16939      50938
##   Educational Expenses          14         67
##   Home Improvements           1080       3866
##   major_purchase                73        239
##   Medical Bills                248        655
##   moving                        38         87
##   other                       1315       3588
##   Other                        552       2200
##   renewable_energy               4          4
##   small_business               112        132
##   Take a Trip                  104        358
##   vacation                      26         57
##   wedding                       21         80

Debt Consolidation should be treated as its own class as it represents a significant portion of the dataset.

# Define the purpose mapping
purpose_mapping <- list(
  "Debt Consolidation" = c("Debt Consolidation"),
  "Personal Expenses" = c("Medical Bills", "wedding", "Educational Expenses", "moving", "vacation", "Take a Trip"),
  "Home and Property" = c("Home Improvements", "Buy House", "renewable_energy"),
  "Business and Entrepreneurship" = c("Business Loan", "small_business"),
  "Transportation" = c("Buy a Car"),
  "Other" = c("major_purchase", "Other", "other")
)

# Reverse the mapping
reversed_purpose_mapping <- unlist(lapply(names(purpose_mapping), function(category) {
  setNames(rep(category, length(purpose_mapping[[category]])), purpose_mapping[[category]])
}))

# Map the reversed purpose mapping to the 'Purpose' column
df2$Purpose <- sapply(df2$Purpose, function(x) reversed_purpose_mapping[x])
crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab
##                                
##                                 Charged Off Fully Paid
##   Business and Entrepreneurship         577        989
##   Debt Consolidation                  16939      50938
##   Home and Property                    1219       4296
##   Other                                1940       6027
##   Personal Expenses                     451       1304
##   Transportation                        189        922

All the redundant classes have been categorised and are ready for further analyses.

Mapping Home Ownership

unique(df2$Home.Ownership)
## [1] "Home Mortgage" "Own Home"      "Rent"          "HaveMortgage"
crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab
##                
##                 Charged Off Fully Paid
##   HaveMortgage           28        147
##   Home Mortgage        9353      32396
##   Own Home             1891       5614
##   Rent                10043      26319

Since the HaveMortgage has the same meaning with HomeMortgage without considering loan status, it is possible to group this two classes together to simplify the analysis without affecting the distribution. Own home is not classified under the home mortgage because the home is 100% totally own by the owner without having to pay any property loan to maintain the ownership of the house.

ownership_mapping <- list(
  "Home Mortgage" = c("Home Mortgage", "HaveMortgage"),
  "Rent" = c("Rent"),
  "Own Home" = c("Own Home")
)
reversed_ownership_mapping <- unlist(lapply(names(ownership_mapping), function(category) {
  setNames(rep(category, length(ownership_mapping[[category]])), ownership_mapping[[category]])
}))
df2$Home.Ownership <- sapply(df2$Home.Ownership, function(x) reversed_ownership_mapping[x])
crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab
##                
##                 Charged Off Fully Paid
##   Home Mortgage        9381      32543
##   Own Home             1891       5614
##   Rent                10043      26319

“HaveMortgage” have been merged with “Home Mortgage”.

Drop Loan ID and Customer ID

# Columns to drop
drop <- c("Customer.ID", "Loan.ID")
# Drop the columns
df2 <- df2[, !(names(df2) %in% drop)]

df2

The cleaned version of the dataframe.

head(df2)
##   Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
## 1  Fully Paid              445412 Short Term          709       1167493
## 2  Fully Paid              262328 Short Term          725       1198701
## 3  Fully Paid            99999999 Short Term          741       2231892
## 4  Fully Paid              347666  Long Term          721        806949
## 5  Fully Paid              176220 Short Term          725       1198701
## 6 Charged Off              206602 Short Term          729        896857
##   Years.in.current.job Home.Ownership            Purpose Monthly.Debt
## 1                    8  Home Mortgage  Home and Property      5214.74
## 2                   10  Home Mortgage Debt Consolidation     33295.98
## 3                    8       Own Home Debt Consolidation     29200.53
## 4                    3       Own Home Debt Consolidation      8741.90
## 5                    5           Rent Debt Consolidation     20639.70
## 6                   10  Home Mortgage Debt Consolidation     16367.74
##   Years.of.Credit.History Number.of.Open.Accounts Number.of.Credit.Problems
## 1                    17.2                       6                         1
## 2                    21.1                      35                         0
## 3                    14.9                      18                         1
## 4                    12.0                       9                         0
## 5                     6.1                      15                         0
## 6                    17.3                       6                         0
##   Current.Credit.Balance Maximum.Open.Credit Bankruptcies Tax.Liens
## 1                 228190              416746            1         0
## 2                 229976              850784            0         0
## 3                 297996              750090            0         0
## 4                 256329              386958            0         0
## 5                 253460              427174            0         0
## 6                 215308              272448            0         0

Exploratory Data Analysis

Categorical Features

cat <- cbind(Credit.Score = df2$Credit.Score,df2[, sapply(df2, function(x) !is.numeric(x))])
head(cat)
##   Credit.Score Loan.Status       Term Home.Ownership            Purpose
## 1          709  Fully Paid Short Term  Home Mortgage  Home and Property
## 2          725  Fully Paid Short Term  Home Mortgage Debt Consolidation
## 3          741  Fully Paid Short Term       Own Home Debt Consolidation
## 4          721  Fully Paid  Long Term       Own Home Debt Consolidation
## 5          725  Fully Paid Short Term           Rent Debt Consolidation
## 6          729 Charged Off Short Term  Home Mortgage Debt Consolidation

Relationship between the target variable and Term

cat$Term <- factor(cat$Term)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))

order <- cat %>%
  count(Term) %>%
  arrange(desc(n)) %>%
  pull(Term)

cat$Term <- factor(cat$Term, levels = order)

# Loan Status by Term
p1 <- ggplot(cat, aes(x = Term, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Term", y = "Count", title = "Loan Status by Term") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_discrete(limits = order)

# Credit Score by Term
p2 <- ggplot(cat, aes(x = Term, y = Credit.Score)) +
  geom_boxplot(aes(fill = Term)) +
  labs(x = "Term", y = "Credit Score", title = "Credit Score by Term") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p1 + p2

For short term, the number of customers who are able to fully paid their loan are higher than the charged off customers. The median credit score of the short term customers are also high even though there are few lower outliers. For long term, the distribution between fully paid and charged of loan is more balanced but the number of fully paid are still higher than charged off. The median credit score of the long term customer is also still considered good but lower compared to the median of short term credit score.

Customers tend to pay in full for short term loan compared to long term loan. They also tend to have a higher credit score because most of them are able to fully paid their loans compared to the long term customer.

Relationship between the target variable and Home.Ownership

crosstab <- table(cat$Home.Ownership, cat$Loan.Status)
crosstab
##                
##                 Fully Paid Charged Off
##   Home Mortgage      32543        9381
##   Own Home            5614        1891
##   Rent               26319       10043
order_home_ownership <- cat %>%
  count(Home.Ownership) %>%
  arrange(desc(n)) %>%
  pull(Home.Ownership)

cat$Home.Ownership <- factor(cat$Home.Ownership, levels = order_home_ownership)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))

# Loan Status by Home Ownership
p1 <- ggplot(cat, aes(x = Home.Ownership, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Home Ownership", y = "Count", title = "Loan Status by Home Ownership") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Credit Score by Home Ownership
p2 <- ggplot(cat, aes(x = Home.Ownership, y = Credit.Score)) +
  geom_boxplot(aes(fill = Home.Ownership)) +
  labs(x = "Home Ownership", y = "Credit Score", title = "Credit Score by Home Ownership") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p1 + p2

For home mortgage, customers tend to fully paid their loan compared to being charged off. The median credit score of customers in this class can be considered the same as the median credit score of the own home class. For rent, this class has the highest numbers of charged off customers and the lowest median credit score among all classes. The number of fully paid loans are higher than charged off for all classes while customers in rent class have a higher difficulty in paying their loans compared to other classes which results in lower median credit score.

Relationship between the target variable and Purpose

order_purpose <- cat %>%
  count(Purpose) %>%
  arrange(n) %>%
  pull(Purpose)

cat$Purpose <- factor(cat$Purpose, levels = order_purpose)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Charged Off", "Fully Paid"))

# Loan Status by Purpose
p1 <- ggplot(cat, aes(y = Purpose, fill = Loan.Status)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
  labs(x = "Count", y = "Purpose", title = "Loan Status by Purpose") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 30, hjust = 0.5),
    axis.title.x = element_text(size = 22),
    axis.title.y = element_text(size = 22),
    axis.text = element_text(size = 20),
    legend.title = element_text(size = 20),
    legend.text = element_text(size = 20)
  )

# Credit Score by Purpose
p2 <- ggplot(cat, aes(x = Purpose, y = Credit.Score)) +
  geom_boxplot(aes(fill = Purpose)) +
  labs(x = "Purpose", y = "Credit Score", title = "Credit Score by Purpose") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 30, hjust = 0.5),
    axis.title.x = element_text(size = 22),
    axis.title.y = element_text(size = 22),
    axis.text = element_text(size = 20),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_text(size = 20),
    legend.text = element_text(size = 20)
  )

p1 / p2

The majority of customers who applied loans for debt consolidation purposes, and most of these loan status are “Fully Paid.” However, a significant portion is also “Charged Off.” The majority of loans are fully paid for all classes, with debt consolidation being the exception with a much higher loan count.

Customers who applied loans for transportation tend to have the highest credit score which indicates they are lower-risk customers. The customers who applied loans for debt consolidation, home and property or the other purposes have moderate credit scores. The customers who applied loans for personal expenses or business purpose have lower and more varied credit scores which indicates that they are higher-risks customers.

Chi-square Test

For the classification problem, the chi-square test is used to determine the significance of the variables.

# Encode Loan.Status
cat$Loan.Status <- as.numeric(factor(cat$Loan.Status))

perform_chi_square_test <- function(var1, var2) {
  table <- table(var1, var2)
  chi_square_result <- chisq.test(table)
  
  result <- data.frame(
    Variable1 = deparse(substitute(var1)),
    Variable2 = deparse(substitute(var2)),
    Chi_Square = chi_square_result$statistic,
    DF = chi_square_result$parameter,
    P_Value = chi_square_result$p.value
  )
  
  return(result)
}
chisquareresults <- bind_rows(
  perform_chi_square_test(cat$Loan.Status, cat$Term),
  perform_chi_square_test(cat$Loan.Status, cat$Purpose),
  perform_chi_square_test(cat$Loan.Status, cat$Home.Ownership)
)

chisquareresults
##                     Variable1          Variable2 Chi_Square DF      P_Value
## X-squared...1 cat$Loan.Status           cat$Term  1776.9359  1 0.000000e+00
## X-squared...2 cat$Loan.Status        cat$Purpose   181.6589  5 2.366598e-37
## X-squared...3 cat$Loan.Status cat$Home.Ownership   287.2476  2 4.216768e-63

All categorical features for the classification problem are significant since their p-value is less than 0.05.

Analysis of Variance (ANOVA)

For the regression problem, the ANOVA is used to determine the significance of the variables.

perform_anova <- function(data, category_column, value_column, alpha = 0.05) {

  formula <- as.formula(paste(value_column, "~", category_column))
  
  anova_result <- aov(formula, data = data)

  summary_result <- summary(anova_result)[[1]]

  summary_result <- summary_result[summary_result[[1]] != "Residuals", ]
  
  # F-statistic and p-value
  f_statistic <- summary_result[1, "F value"]
  p_value <- summary_result[1, "Pr(>F)"]
  
  # F-critical value
  dfn <- length(unique(data[[category_column]])) - 1  
  dfd <- nrow(data) - length(unique(data[[category_column]]))  
  f_critical <- qf(1 - alpha, dfn, dfd)
  
  # Signifincance
  f_statistic_significance <- ifelse(f_statistic > f_critical, "Significant", "Not significant")
  p_value_significance <- ifelse(p_value < alpha, "Significant", "Not significant")
  
  result <- list(
    Feature = category_column,
    `F-statistic` = round(f_statistic, 6),
    `F-critical value` = round(f_critical, 6),
    `p-value` = round(p_value, 6),
    `F-statistic Significance` = f_statistic_significance,
    `p-value Significance` = p_value_significance
  )
  
  return(result)
}
anova_results <- bind_rows(
  perform_anova(cat, "Term", "Credit.Score"),
  perform_anova(cat, "Purpose", "Credit.Score"),
  perform_anova(cat, "Home.Ownership", "Credit.Score")
)

anova_results
## # A tibble: 3 × 6
##   Feature      `F-statistic` `F-critical value` `p-value` F-statistic Signific…¹
##   <chr>                <dbl>              <dbl>     <dbl> <chr>                 
## 1 Term               19382.                3.84         0 Significant           
## 2 Purpose              180.                2.21         0 Significant           
## 3 Home.Owners…          79.6               3.00         0 Significant           
## # ℹ abbreviated name: ¹​`F-statistic Significance`
## # ℹ 1 more variable: `p-value Significance` <chr>

All categorical features for the regression problem are significant since their p-value is less than 0.05.

Numerical Features

num <- df2 %>% select_if(is.numeric)
head(num)
##   Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1              445412          709       1167493                    8
## 2              262328          725       1198701                   10
## 3            99999999          741       2231892                    8
## 4              347666          721        806949                    3
## 5              176220          725       1198701                    5
## 6              206602          729        896857                   10
##   Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1      5214.74                    17.2                       6
## 2     33295.98                    21.1                      35
## 3     29200.53                    14.9                      18
## 4      8741.90                    12.0                       9
## 5     20639.70                     6.1                      15
## 6     16367.74                    17.3                       6
##   Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1                         1                 228190              416746
## 2                         0                 229976              850784
## 3                         1                 297996              750090
## 4                         0                 256329              386958
## 5                         0                 253460              427174
## 6                         0                 215308              272448
##   Bankruptcies Tax.Liens
## 1            1         0
## 2            0         0
## 3            0         0
## 4            0         0
## 5            0         0
## 6            0         0

Correlation Matrix

corrplot(cor(num), 
         method = "shade",           
         mar = c(0, 0, 11, 0),
         title = "Correlation Matrix",
         cex.main = 6,
         tl.cex = 5.3,                  
         cl.cex = 2,                  
         addCoef.col = "black",
         number.cex = 3,            
         diag = F)

Relationship between the target variable and Years in Current Job

df2$Years.in.current.job <- factor(df2$Years.in.current.job, 
                                   levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))

# Loan Status by Years in Current Job
p1 <- ggplot(df2, aes(x = Years.in.current.job, fill = `Loan.Status`)) +
  geom_bar(position = "dodge") +
  scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years', 
                              '6 years', '7 years', '8 years', '9 years', '10+ years')) +
  scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) + 
  labs(x = "Years in Current Job", y = "Count", title = "Loan Status by Years in Current Job") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18),
        legend.title = element_text(size = 16),               
        legend.text = element_text(size = 14))                

# Credit Score by Years in Current Job
p2 <- ggplot(df2, aes(x = Years.in.current.job, y = Credit.Score)) +
  geom_boxplot(aes(fill = Years.in.current.job)) +
  scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years', 
                              '6 years', '7 years', '8 years', '9 years', '10+ years')) +
  labs(x = "Years in Current Job", y = "Credit Score", title = "Credit Score by Years in Current Job") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18),
        legend.title = element_text(size = 16),               
        legend.text = element_text(size = 14)) 

p1 / p2

df2$Years.in.current.job <- as.numeric(df2$Years.in.current.job)

10+ years have the highest count of fully Paid loans compared to all other groups which indicates that customers with longer job tenures tend to have better repayment records. Credit scores are stable across all classes.

Outliers Detection

threshold_percentage <- 50

for (col in colnames(num)) {
  most_frequent_value <- names(sort(table(num[[col]]), decreasing = TRUE))[1]
  if (most_frequent_value == 0) {
    frequency <- max(table(num[[col]]))
    percentage <- (frequency / length(num[[col]])) * 100
    if (percentage > threshold_percentage) {
      cat("Column:", col, "\n")
      cat("Unique values:", paste(unique(num[[col]]), collapse = ", "), "\n")
      cat("Most frequent value:", most_frequent_value, "\n")
      cat("Frequency:", frequency, "\n")
      cat("Percentage:", sprintf("%.2f", percentage), "%\n")
      cat("\n")
    }
  }
}
## Column: Number.of.Credit.Problems 
## Unique values: 1, 0, 5, 2, 4, 3, 7, 6, 11, 15, 8, 10, 9, 12 
## Most frequent value: 0 
## Frequency: 74430 
## Percentage: 86.76 %
## 
## Column: Bankruptcies 
## Unique values: 1, 0, 2, 3, 5, 4, 7, 6 
## Most frequent value: 0 
## Frequency: 76864 
## Percentage: 89.59 %
## 
## Column: Tax.Liens 
## Unique values: 0, 1, 4, 2, 3, 6, 5, 11, 15, 7, 10, 9 
## Most frequent value: 0 
## Frequency: 84231 
## Percentage: 98.18 %

Since most of the data points are greater than 0 but still consist of relatively small integers, it’s possible that the outlier removal process excludes all values except 0. Thus, exclude these features from the outlier detection.

num_col <- colnames(num)
remove <- c('Number.of.Credit.Problems', 'Bankruptcies', 'Tax.Liens')
num_col <- setdiff(num_col, remove)
num <- num[, num_col]
head(num)
##   Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1              445412          709       1167493                    8
## 2              262328          725       1198701                   10
## 3            99999999          741       2231892                    8
## 4              347666          721        806949                    3
## 5              176220          725       1198701                    5
## 6              206602          729        896857                   10
##   Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1      5214.74                    17.2                       6
## 2     33295.98                    21.1                      35
## 3     29200.53                    14.9                      18
## 4      8741.90                    12.0                       9
## 5     20639.70                     6.1                      15
## 6     16367.74                    17.3                       6
##   Current.Credit.Balance Maximum.Open.Credit
## 1                 228190              416746
## 2                 229976              850784
## 3                 297996              750090
## 4                 256329              386958
## 5                 253460              427174
## 6                 215308              272448
num_long <- num %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

# Create the boxplot
ggplot(num_long, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  # Rotate x-axis labels to 90 degrees
  labs(title = "Outlier", x = "Feature", y = "Value") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "none")

Encoding

encode <- function(data, columns) {
  for (col in columns) {
    # Print unique values before encoding
    cat("Before encoding:", "\n")
    print(unique(data[[col]]))
    
    # Encoding the categorical column
    data[[col]] <- as.integer(factor(data[[col]]))
    
    # Print the encoded unique values
    cat("After encoding:", "\n")
    print(unique(data[[col]]))
    
    cat("\n")
  }
  
  return(data)
}
cat_col <- c('Loan.Status', 'Term', 'Home.Ownership', 'Purpose')
df2 <- encode(df2, cat_col)
## Before encoding: 
## [1] "Fully Paid"  "Charged Off"
## After encoding: 
## [1] 2 1
## 
## Before encoding: 
## [1] "Short Term" "Long Term" 
## After encoding: 
## [1] 2 1
## 
## Before encoding: 
## [1] "Home Mortgage" "Own Home"      "Rent"         
## After encoding: 
## [1] 1 2 3
## 
## Before encoding: 
## [1] "Home and Property"             "Debt Consolidation"           
## [3] "Business and Entrepreneurship" "Transportation"               
## [5] "Other"                         "Personal Expenses"            
## After encoding: 
## [1] 3 2 1 6 4 5

Variance Inflation Factor (VIF)

VIF is used to detect multicollinearity among independent variables.

Classification Problem

# VIF for classification problem
lm_classification <- lm(as.formula(paste("Loan.Status ~", paste(setdiff(colnames(df2), "Loan.Status"), collapse = "+"))), data = df2)
vif_classification_results <- vif(lm_classification)
vif_classification <- data.frame(Feature = names(vif_classification_results),
                                 VIF = vif_classification_results) %>%
          arrange(desc(VIF))
ggplot(vif_classification, aes(x = reorder(Feature, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "#66C2A5") +
  geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
  coord_flip() +
  labs(
    title = "Variance Inflation Factor (VIF) for Classification Problem Features",
    x = "Feature",
    y = "VIF"
  ) +
  theme_minimal(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5, size =55),
        axis.title.x = element_text(size = 45),
        axis.title.y = element_text(size = 45),
        axis.text = element_text(size = 40))

The VIF of all independent features of the classification problem are acceptable since VIF are less than 10.

Regression Problem

# VIF for classification problem
lm_regression <- lm(as.formula(paste("Credit.Score ~", paste(setdiff(colnames(df2), "Credit.Score"), collapse = "+"))), data = df2)
vif_regression_results <- vif(lm_regression)
vif_regression <- data.frame(Feature = names(vif_regression_results), 
                             VIF = vif_regression_results) %>%
          arrange(desc(VIF))
ggplot(vif_regression, aes(x = reorder(Feature, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "#FC8D62") +
  geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
  coord_flip() +
  labs(
    title = "Variance Inflation Factor (VIF) for Regression Problem Features",
    x = "Feature",
    y = "VIF"
  ) +
  theme_minimal(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5, size =55),
        axis.title.x = element_text(size = 45),
        axis.title.y = element_text(size = 45),
        axis.text = element_text(size = 40))

The VIF of all independent features of the regression problem are acceptable since VIF are less than 10.

Modelling

Classification Problem

Data Splitting

set.seed(123)
classet <- createDataPartition(df2$Loan.Status, p=0.8, list=F)

clastrain <- df2[classet, ]
clastest <- df2[-classet, ]

prop.table(table(df2$Loan.Status))
## 
##         1         2 
## 0.2484526 0.7515474
prop.table(table(clastrain$Loan.Status))
## 
##         1         2 
## 0.2492678 0.7507322
prop.table(table(clastest$Loan.Status))
## 
##         1         2 
## 0.2451917 0.7548083

The class distribution of Loan.Status remains the same across all set.

Resampling using SMOTE

classmote <- SMOTE(X = clastrain[, -1], 
                target = clastrain$Loan.Status, 
                K = 5, 
                dup_size = 2)
clastrain1 <- data.frame(classmote$data)
names(clastrain1)[ncol(clastrain1)] <- "Loan.Status"
clastrain1$Loan.Status <- as.factor(clastrain1$Loan.Status)

clastrain1_x <- as.matrix(clastrain1[, -which(names(clastrain1) == "Loan.Status")])
clastrain1_y <- as.numeric(clastrain1$Loan.Status) - 1

clastest_x <- as.matrix(clastest[, -which(names(clastest) == "Loan.Status")])
clastest_y <- as.numeric(clastest$Loan.Status) - 1
prop.table(table(clastrain$Loan.Status))
## 
##         1         2 
## 0.2492678 0.7507322
prop.table(table(clastrain1$Loan.Status))
## 
##         1         2 
## 0.4990228 0.5009772

The minority class is duplicated 2 times to get an equal distribution with the majority class.

Model Training

rf_classifier <- randomForest(Loan.Status ~ ., data = clastrain1,
                              ntree = 100, 
                              importance = TRUE)
xgb_classifier <- xgboost(
  data = clastrain1_x,          
  label = clastrain1_y,
  nrounds = 100,
  objective = 'binary:logistic'
)
## [1]  train-logloss:0.596056 
## [2]  train-logloss:0.542586 
## [3]  train-logloss:0.509624 
## [4]  train-logloss:0.486869 
## [5]  train-logloss:0.468468 
## [6]  train-logloss:0.450323 
## [7]  train-logloss:0.440073 
## [8]  train-logloss:0.427220 
## [9]  train-logloss:0.419883 
## [10] train-logloss:0.412799 
## [11] train-logloss:0.409033 
## [12] train-logloss:0.401297 
## [13] train-logloss:0.398308 
## [14] train-logloss:0.393384 
## [15] train-logloss:0.391144 
## [16] train-logloss:0.387098 
## [17] train-logloss:0.383878 
## [18] train-logloss:0.380269 
## [19] train-logloss:0.378728 
## [20] train-logloss:0.377309 
## [21] train-logloss:0.375065 
## [22] train-logloss:0.369018 
## [23] train-logloss:0.361935 
## [24] train-logloss:0.359263 
## [25] train-logloss:0.358279 
## [26] train-logloss:0.355269 
## [27] train-logloss:0.353848 
## [28] train-logloss:0.352044 
## [29] train-logloss:0.351285 
## [30] train-logloss:0.350476 
## [31] train-logloss:0.349817 
## [32] train-logloss:0.346267 
## [33] train-logloss:0.345514 
## [34] train-logloss:0.344774 
## [35] train-logloss:0.342828 
## [36] train-logloss:0.342304 
## [37] train-logloss:0.341778 
## [38] train-logloss:0.338915 
## [39] train-logloss:0.338116 
## [40] train-logloss:0.337389 
## [41] train-logloss:0.337218 
## [42] train-logloss:0.335160 
## [43] train-logloss:0.334314 
## [44] train-logloss:0.333590 
## [45] train-logloss:0.333170 
## [46] train-logloss:0.332305 
## [47] train-logloss:0.331696 
## [48] train-logloss:0.331025 
## [49] train-logloss:0.330768 
## [50] train-logloss:0.330379 
## [51] train-logloss:0.329106 
## [52] train-logloss:0.328854 
## [53] train-logloss:0.328327 
## [54] train-logloss:0.326051 
## [55] train-logloss:0.325576 
## [56] train-logloss:0.324394 
## [57] train-logloss:0.323411 
## [58] train-logloss:0.321972 
## [59] train-logloss:0.321335 
## [60] train-logloss:0.320433 
## [61] train-logloss:0.320057 
## [62] train-logloss:0.319495 
## [63] train-logloss:0.319166 
## [64] train-logloss:0.317997 
## [65] train-logloss:0.317380 
## [66] train-logloss:0.316968 
## [67] train-logloss:0.316569 
## [68] train-logloss:0.315874 
## [69] train-logloss:0.315299 
## [70] train-logloss:0.314597 
## [71] train-logloss:0.313909 
## [72] train-logloss:0.313707 
## [73] train-logloss:0.313235 
## [74] train-logloss:0.312517 
## [75] train-logloss:0.311922 
## [76] train-logloss:0.311067 
## [77] train-logloss:0.310525 
## [78] train-logloss:0.310114 
## [79] train-logloss:0.309645 
## [80] train-logloss:0.309160 
## [81] train-logloss:0.308734 
## [82] train-logloss:0.308627 
## [83] train-logloss:0.308426 
## [84] train-logloss:0.307926 
## [85] train-logloss:0.307294 
## [86] train-logloss:0.307170 
## [87] train-logloss:0.307060 
## [88] train-logloss:0.306673 
## [89] train-logloss:0.306314 
## [90] train-logloss:0.306202 
## [91] train-logloss:0.305896 
## [92] train-logloss:0.305518 
## [93] train-logloss:0.304962 
## [94] train-logloss:0.304567 
## [95] train-logloss:0.304233 
## [96] train-logloss:0.303952 
## [97] train-logloss:0.303403 
## [98] train-logloss:0.302593 
## [99] train-logloss:0.302280 
## [100]    train-logloss:0.301792

Test the Model

rf_clas_pred_prob <- predict(rf_classifier, newdata = clastest, type = "prob")
rf_clas_pred <- ifelse(rf_clas_pred_prob[, 2] > 0.5, 1, 0)
xgb_clas_pred_prob <- predict(xgb_classifier, clastest_x)
xgb_clas_pred <- ifelse(xgb_clas_pred_prob > 0.5, 1, 0)

Regression Problem

Data Splitting

set.seed(123)
regset <- createDataPartition(df2$Credit.Score, p=0.8, list=F)

regtrain <- df2[regset, ]
regtest <- df2[-regset, ]

regtrain_x <- as.matrix(regtrain[, -which(names(regtrain) == "Credit.Score")])
regtrain_y <- regtrain$Credit.Score

regtest_x <- as.matrix(regtest[, -which(names(regtest) == "Credit.Score")])
regtest_y <- regtest$Credit.Score

Model Training

rf_reg <- randomForest(Credit.Score ~ ., data=regtrain, ntree = 50)
xgb_reg <- xgboost(
  data = regtrain_x,          
  label = regtrain_y,
  nrounds = 100
)
## [1]  train-rmse:504.152902 
## [2]  train-rmse:353.258614 
## [3]  train-rmse:247.772815 
## [4]  train-rmse:174.113107 
## [5]  train-rmse:122.808234 
## [6]  train-rmse:87.233804 
## [7]  train-rmse:62.787166 
## [8]  train-rmse:46.273409 
## [9]  train-rmse:35.424976 
## [10] train-rmse:28.599793 
## [11] train-rmse:24.494535 
## [12] train-rmse:22.172118 
## [13] train-rmse:20.899739 
## [14] train-rmse:20.196642 
## [15] train-rmse:19.802898 
## [16] train-rmse:19.535091 
## [17] train-rmse:19.395052 
## [18] train-rmse:19.305943 
## [19] train-rmse:19.223774 
## [20] train-rmse:19.146398 
## [21] train-rmse:19.062564 
## [22] train-rmse:19.009610 
## [23] train-rmse:18.957081 
## [24] train-rmse:18.904423 
## [25] train-rmse:18.859871 
## [26] train-rmse:18.806744 
## [27] train-rmse:18.785699 
## [28] train-rmse:18.760868 
## [29] train-rmse:18.716777 
## [30] train-rmse:18.696336 
## [31] train-rmse:18.660398 
## [32] train-rmse:18.616775 
## [33] train-rmse:18.597696 
## [34] train-rmse:18.551940 
## [35] train-rmse:18.538307 
## [36] train-rmse:18.487860 
## [37] train-rmse:18.451868 
## [38] train-rmse:18.444705 
## [39] train-rmse:18.405782 
## [40] train-rmse:18.382120 
## [41] train-rmse:18.365880 
## [42] train-rmse:18.344140 
## [43] train-rmse:18.322081 
## [44] train-rmse:18.289515 
## [45] train-rmse:18.266673 
## [46] train-rmse:18.259950 
## [47] train-rmse:18.234204 
## [48] train-rmse:18.209596 
## [49] train-rmse:18.173723 
## [50] train-rmse:18.144786 
## [51] train-rmse:18.126122 
## [52] train-rmse:18.093391 
## [53] train-rmse:18.066887 
## [54] train-rmse:18.045779 
## [55] train-rmse:18.020931 
## [56] train-rmse:17.995193 
## [57] train-rmse:17.975349 
## [58] train-rmse:17.942384 
## [59] train-rmse:17.926082 
## [60] train-rmse:17.906469 
## [61] train-rmse:17.884972 
## [62] train-rmse:17.864283 
## [63] train-rmse:17.845357 
## [64] train-rmse:17.830017 
## [65] train-rmse:17.822264 
## [66] train-rmse:17.806078 
## [67] train-rmse:17.803143 
## [68] train-rmse:17.777942 
## [69] train-rmse:17.745179 
## [70] train-rmse:17.740964 
## [71] train-rmse:17.708373 
## [72] train-rmse:17.695856 
## [73] train-rmse:17.671558 
## [74] train-rmse:17.633395 
## [75] train-rmse:17.615272 
## [76] train-rmse:17.598380 
## [77] train-rmse:17.588348 
## [78] train-rmse:17.566782 
## [79] train-rmse:17.542397 
## [80] train-rmse:17.528464 
## [81] train-rmse:17.502755 
## [82] train-rmse:17.491953 
## [83] train-rmse:17.478933 
## [84] train-rmse:17.455682 
## [85] train-rmse:17.442524 
## [86] train-rmse:17.435676 
## [87] train-rmse:17.424583 
## [88] train-rmse:17.407363 
## [89] train-rmse:17.381464 
## [90] train-rmse:17.365696 
## [91] train-rmse:17.363838 
## [92] train-rmse:17.360838 
## [93] train-rmse:17.345311 
## [94] train-rmse:17.321054 
## [95] train-rmse:17.302149 
## [96] train-rmse:17.270130 
## [97] train-rmse:17.250148 
## [98] train-rmse:17.225718 
## [99] train-rmse:17.205127 
## [100]    train-rmse:17.182926

Test the Model

rf_reg_pred <- predict(rf_reg, newdata = regtest_x)
xgb_reg_pred <- predict(xgb_reg, regtest_x)

Evaluation

Classification Problem

# Random Forest performance
rf_conf_matrix <- confusionMatrix(factor(rf_clas_pred), factor(clastest_y))
rf_accuracy <- rf_conf_matrix$overall["Accuracy"]
rf_precision <- rf_conf_matrix$byClass["Pos Pred Value"]
rf_recall <- rf_conf_matrix$byClass["Sensitivity"]
rf_f1_score <- rf_conf_matrix$byClass["F1"]
roc_rf <- roc(clastest_y, rf_clas_pred_prob[, 2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
rf_auc_roc <- auc(roc_rf)

# XGBoost performance
xgb_conf_matrix <- confusionMatrix(factor(xgb_clas_pred), factor(clastest_y))
xgb_accuracy <- xgb_conf_matrix$overall["Accuracy"]
xgb_precision <- xgb_conf_matrix$byClass["Pos Pred Value"]
xgb_recall <- xgb_conf_matrix$byClass["Sensitivity"]
xgb_f1_score <- xgb_conf_matrix$byClass["F1"]
xgb_roc <- roc(clastest_y, xgb_clas_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
xgb_auc_roc <- auc(xgb_roc)
classification_results <- data.frame(
  Model = c("Random Forest", "XGBoost"),
  Accuracy = c(rf_accuracy, xgb_accuracy),
  Precision = c(rf_precision, xgb_precision),
  Recall = c(rf_recall, xgb_recall),
  F1_Score = c(rf_f1_score, xgb_f1_score),
  AUC_ROC = c(rf_auc_roc, xgb_auc_roc)
)

classification_results
##           Model  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1 Random Forest 0.7458911 0.4567063 0.1918232 0.2701707 0.7037483
## 2       XGBoost 0.7555076 0.5054054 0.1333492 0.2110213 0.6988339

The machine learning models were evaluated based on their performance metrics: Accuracy, Precision, Recall, F1-Score and AUC-ROC. XGBoost has a slightly better accuracy & precision compared to Random Forest. However, Random Forest performs slightly better in terms of the recall, F1-Score and AUC ROC

Regression Problem

evaluate_regression <- function(predictions, actuals) {
  # Using caret
  r_squared <- R2(pred = predictions, obs = actuals)
  rmse <- RMSE(pred = predictions, obs = actuals)
  
  # Compile results into a data frame
  results <- data.frame(R_squared = r_squared, RMSE = rmse)
  return(results)
}
rf_reg_results <- evaluate_regression(rf_reg_pred, regtest_y)
xgb_reg_results <- evaluate_regression(xgb_reg_pred, regtest_y)

regression_results <- data.frame(
  Model = c("Random Forest Regressor", "XGBoost Regressor"),
  R_Squared = c(rf_reg_results$R_squared, xgb_reg_results$R_squared),
  RMSE = c(rf_reg_results$RMSE, xgb_reg_results$RMSE)
)

regression_results
##                     Model R_Squared     RMSE
## 1 Random Forest Regressor 0.3828138 19.68817
## 2       XGBoost Regressor 0.3827512 19.68262

For the regression problem, the models were evaluated based on R-Squared and Root Mean Squared Error (RMSE). Both models have nearly identical performance for R-Squared where the difference is only 0.0000626. This is also a similar case for RMSE where they are almost identical with XGBoost has a slightly lower RMSE.

Conclusion

For loan eligibility prediction, both Random Forest and XGBoost were effective. However, the choice between them depends on the problem’s priorities: If accuracy and precision are crucial, XGBoost is slightly better. If recall and overall balance are more critical, Random Forest performs better.

For predicting credit scores, both models demonstrated similar performance, explaining approximately 38% of the variance in the data. The minimal difference in RMSE suggests either model can be used effectively.

This project successfully demonstrated the application of machine learning for both classification and regression tasks in the context of loan eligibility and credit score prediction. The models provide actionable insights that can aid financial institutions in assessing loan risk and creditworthiness. Future improvements could focus on hyperparameter tuning and feature engineering for better performance.