Group 9: Loan Eligibility Prediction and Default Risk
Group Members:
- Mohammad Iqbal Afif bin Mohamad Shahnaz (U2004720)
- Ahmad Marwan Bin Murshidi (17205900)
- Mawaddah Binti Musthafa (17147301)
- Mohamad Ziqry Bin Zulkifli (S2153309)
- Ahmad Salim Odeh Alsane (23083537)
- Determine the details about the dataset (e.g. title, year, purpose of dataset, dimension, content, structure, and the summary) by exploring the raw data.
- Data cleaning: Putting it all together as a new cleaned/processed dataset. Which section of the data do you need to tidy? You can use any cleaning packages in R
- Data analysis: Identify the questions, what is the objective/goal of processing this dataset? What answers are you interested to find from this dataset? You must have at least two questions. It is best to have different types of problems, i.e. one regression, and one classification.
Introduction: In the past, banks used a manual and often unreliable process to review loan applications. This traditional method raised some struggles to accurately predict the risk of default. This project leverages machine learning techniques in R to develop a Loan Prediction system. By automating and enhancing the assessment process, this system aims to provide faster and more accurate evaluations, ultimately helping banks minimize losses from loan defaults.
Objectives: 1. To predict potential customer’s ability to repay their loan based on the information obtained from existing customers. 2. To develop a classification model to predict the applicant’s approval for a loan based on their financial and demographic information. 3. To create a regression model to estimate the probability of default for each applicant. 4. To improve the accuracy of loan eligibility predictions to enhance the identification of potential defaulters and minimize banks’ financial risks.
Questions: 1. Classification problem - Which applicants are likely to be eligible for a loan? (Target Variable: Loan.Status) 2. Regression problem - What is the expected default risk for each applicant? (Target Variable: Credit.Score)
Install and Import Libraries
# install.packages("psych")
library(psych)
# install.packages("tidyverse")
library(tidyverse)
# install.packages("scales")
library(scales)
# install.packages("stats")
library(stats)
# install.packages("patchwork")
library(patchwork)
# install.packages("corrplot")
library(corrplot)
# install.packages("car")
library(car)
# install.packages("caret")
library(caret)
# install.packages("smotefamily")
library(smotefamily)
# install.packages('randomForest')
library(randomForest)
# install.packages('xgboost')
library(xgboost)
# install.packages('pROC')
library(pROC)Loading Data
df_raw <- read.csv("https://github.com/bilecops/Projects/blob/71b0f6e4f715355606e8e2ed3cbb729d16a105ab/credit_train.csv?raw=true")Data Exploration
- Display the columns name
colnames(df_raw)## [1] "Loan.ID" "Customer.ID"
## [3] "Loan.Status" "Current.Loan.Amount"
## [5] "Term" "Credit.Score"
## [7] "Annual.Income" "Years.in.current.job"
## [9] "Home.Ownership" "Purpose"
## [11] "Monthly.Debt" "Years.of.Credit.History"
## [13] "Months.since.last.delinquent" "Number.of.Open.Accounts"
## [15] "Number.of.Credit.Problems" "Current.Credit.Balance"
## [17] "Maximum.Open.Credit" "Bankruptcies"
## [19] "Tax.Liens"
- Display the data frame
head(df_raw)## Loan.ID Customer.ID
## 1 14dd8831-6af5-400b-83ec-68e61888a048 981165ec-3274-42f5-a3b4-d104041a9ca9
## 2 4771cc26-131a-45db-b5aa-537ea4ba5342 2de017a3-2e01-49cb-a581-08169e83be29
## 3 4eed4e6a-aa2f-4c91-8651-ce984ee8fb26 5efb2b2b-bf11-4dfd-a572-3761a2694725
## 4 77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a e777faab-98ae-45af-9a86-7ce5b33b1011
## 5 d4062e70-befa-4995-8643-a0de73938182 81536ad9-5ccf-4eb8-befb-47a4d608658e
## 6 89d8cb0c-e5c2-4f54-b056-48a645c543dd 4ffe99d3-7f2a-44db-afc1-40943f1f9750
## Loan.Status Current.Loan.Amount Term Credit.Score Annual.Income
## 1 Fully Paid 445412 Short Term 709 1167493
## 2 Fully Paid 262328 Short Term NA NA
## 3 Fully Paid 99999999 Short Term 741 2231892
## 4 Fully Paid 347666 Long Term 721 806949
## 5 Fully Paid 176220 Short Term NA NA
## 6 Charged Off 206602 Short Term 7290 896857
## Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 1 8 years Home Mortgage Home Improvements 5214.74
## 2 10+ years Home Mortgage Debt Consolidation 33295.98
## 3 8 years Own Home Debt Consolidation 29200.53
## 4 3 years Own Home Debt Consolidation 8741.90
## 5 5 years Rent Debt Consolidation 20639.70
## 6 10+ years Home Mortgage Debt Consolidation 16367.74
## Years.of.Credit.History Months.since.last.delinquent Number.of.Open.Accounts
## 1 17.2 NA 6
## 2 21.1 8 35
## 3 14.9 29 18
## 4 12.0 NA 9
## 5 6.1 NA 15
## 6 17.3 NA 6
## Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1 1 228190 416746
## 2 0 229976 850784
## 3 1 297996 750090
## 4 0 256329 386958
## 5 0 253460 427174
## 6 0 215308 272448
## Bankruptcies Tax.Liens
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
tail(df_raw)## Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509 NA NA
## 100510 NA NA
## 100511 NA NA
## 100512 NA NA
## 100513 NA NA
## 100514 NA NA
## Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509 NA NA
## 100510 NA NA
## 100511 NA NA
## 100512 NA NA
## 100513 NA NA
## 100514 NA NA
## Years.of.Credit.History Months.since.last.delinquent
## 100509 NA NA
## 100510 NA NA
## 100511 NA NA
## 100512 NA NA
## 100513 NA NA
## 100514 NA NA
## Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509 NA NA NA
## 100510 NA NA NA
## 100511 NA NA NA
## 100512 NA NA NA
## 100513 NA NA NA
## 100514 NA NA NA
## Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509 NA NA NA
## 100510 NA NA NA
## 100511 NA NA NA
## 100512 NA NA NA
## 100513 NA NA NA
## 100514 NA NA NA
- Change missing value as white space into NA value
df_raw[df_raw == "" | trimws(df_raw) == ""] <- NA
tail(df_raw)## Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
## 100509 <NA> <NA> <NA> NA <NA> NA
## 100510 <NA> <NA> <NA> NA <NA> NA
## 100511 <NA> <NA> <NA> NA <NA> NA
## 100512 <NA> <NA> <NA> NA <NA> NA
## 100513 <NA> <NA> <NA> NA <NA> NA
## 100514 <NA> <NA> <NA> NA <NA> NA
## Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 100509 NA <NA> <NA> <NA> NA
## 100510 NA <NA> <NA> <NA> NA
## 100511 NA <NA> <NA> <NA> NA
## 100512 NA <NA> <NA> <NA> NA
## 100513 NA <NA> <NA> <NA> NA
## 100514 NA <NA> <NA> <NA> NA
## Years.of.Credit.History Months.since.last.delinquent
## 100509 NA NA
## 100510 NA NA
## 100511 NA NA
## 100512 NA NA
## 100513 NA NA
## 100514 NA NA
## Number.of.Open.Accounts Number.of.Credit.Problems Current.Credit.Balance
## 100509 NA NA NA
## 100510 NA NA NA
## 100511 NA NA NA
## 100512 NA NA NA
## 100513 NA NA NA
## 100514 NA NA NA
## Maximum.Open.Credit Bankruptcies Tax.Liens
## 100509 NA NA NA
## 100510 NA NA NA
## 100511 NA NA NA
## 100512 NA NA NA
## 100513 NA NA NA
## 100514 NA NA NA
- Check the number of rows with all NA values across columns
sum(apply(df_raw, 1, function(row) all(is.na(row))))## [1] 514
- Drop row with all NA values across columns
df1 <- df_raw[rowSums(is.na(df_raw)) != ncol(df_raw), ]- Check the current number of rows and columns of data frame
cat("Number of row: ", nrow(df1), "\n", "Number of column: ", ncol(df1), sep="")## Number of row: 100000
## Number of column: 19
- Check the data type of each column
lapply(df1, class)## $Loan.ID
## [1] "character"
##
## $Customer.ID
## [1] "character"
##
## $Loan.Status
## [1] "character"
##
## $Current.Loan.Amount
## [1] "integer"
##
## $Term
## [1] "character"
##
## $Credit.Score
## [1] "integer"
##
## $Annual.Income
## [1] "integer"
##
## $Years.in.current.job
## [1] "character"
##
## $Home.Ownership
## [1] "character"
##
## $Purpose
## [1] "character"
##
## $Monthly.Debt
## [1] "numeric"
##
## $Years.of.Credit.History
## [1] "numeric"
##
## $Months.since.last.delinquent
## [1] "integer"
##
## $Number.of.Open.Accounts
## [1] "integer"
##
## $Number.of.Credit.Problems
## [1] "integer"
##
## $Current.Credit.Balance
## [1] "integer"
##
## $Maximum.Open.Credit
## [1] "integer"
##
## $Bankruptcies
## [1] "integer"
##
## $Tax.Liens
## [1] "integer"
- Describe the data
describe(df1)## vars n mean sd median
## Loan.ID* 1 100000 41014.91 23696.13 41036.5
## Customer.ID* 2 100000 41016.45 23686.34 41020.5
## Loan.Status* 3 100000 1.77 0.42 2.0
## Current.Loan.Amount 4 100000 11760447.39 31783942.55 312246.0
## Term* 5 100000 1.72 0.45 2.0
## Credit.Score 6 80846 1076.46 1475.40 724.0
## Annual.Income 7 80846 1378276.56 1081360.20 1174162.0
## Years.in.current.job* 8 100000 5.12 3.10 4.0
## Home.Ownership* 9 100000 2.93 0.95 3.0
## Purpose* 10 100000 4.80 2.21 4.0
## Monthly.Debt 11 100000 18472.41 12174.99 16220.3
## Years.of.Credit.History 12 100000 18.20 7.02 16.9
## Months.since.last.delinquent 13 46859 34.90 22.00 32.0
## Number.of.Open.Accounts 14 100000 11.13 5.01 10.0
## Number.of.Credit.Problems 15 100000 0.17 0.48 0.0
## Current.Credit.Balance 16 100000 294637.38 376170.93 209817.0
## Maximum.Open.Credit 17 99998 760798.38 8384503.47 467874.0
## Bankruptcies 18 99796 0.12 0.35 0.0
## Tax.Liens 19 99990 0.03 0.26 0.0
## trimmed mad min max
## Loan.ID* 41019.01 30426.66 1.0 81999.0
## Customer.ID* 41016.79 30410.35 1.0 81999.0
## Loan.Status* 1.84 0.00 1.0 2.0
## Current.Loan.Amount 2191409.03 218926.65 10802.0 99999999.0
## Term* 1.78 0.00 1.0 2.0
## Credit.Score 722.47 25.20 585.0 7510.0
## Annual.Income 1249929.04 564430.27 76627.0 165557393.0
## Years.in.current.job* 4.85 2.97 1.0 12.0
## Home.Ownership* 2.92 1.48 1.0 4.0
## Purpose* 4.25 0.00 1.0 16.0
## Monthly.Debt 17090.61 9969.71 0.0 435843.3
## Years.of.Credit.History 17.53 5.93 3.6 70.5
## Months.since.last.delinquent 33.63 25.20 0.0 176.0
## Number.of.Open.Accounts 10.66 4.45 0.0 76.0
## Number.of.Credit.Problems 0.05 0.00 0.0 15.0
## Current.Credit.Balance 239638.11 170509.38 0.0 32878968.0
## Maximum.Open.Credit 526283.54 342333.82 0.0 1539737892.0
## Bankruptcies 0.01 0.00 0.0 7.0
## Tax.Liens 0.00 0.00 0.0 15.0
## range skew kurtosis se
## Loan.ID* 81998.0 0.00 -1.20 74.93
## Customer.ID* 81998.0 0.00 -1.20 74.90
## Loan.Status* 1.0 -1.31 -0.29 0.00
## Current.Loan.Amount 99989197.0 2.42 3.84 100509.65
## Term* 1.0 -0.99 -1.02 0.00
## Credit.Score 6925.0 3.86 12.97 5.19
## Annual.Income 165480766.0 46.89 6623.59 3803.13
## Years.in.current.job* 11.0 0.72 -0.59 0.01
## Home.Ownership* 3.0 0.12 -1.86 0.00
## Purpose* 15.0 2.21 4.55 0.01
## Monthly.Debt 435843.3 2.21 22.19 38.50
## Years.of.Credit.History 66.9 1.07 1.74 0.02
## Months.since.last.delinquent 176.0 0.43 -0.75 0.10
## Number.of.Open.Accounts 76.0 1.18 3.04 0.02
## Number.of.Credit.Problems 15.0 4.82 48.01 0.00
## Current.Credit.Balance 32878968.0 14.15 697.45 1189.56
## Maximum.Open.Credit 1539737892.0 132.63 20393.41 26514.39
## Bankruptcies 7.0 3.51 18.53 0.00
## Tax.Liens 15.0 15.50 402.04 0.00
- Check the number of unique values for each column
lapply(df1, function(x) length(unique(x)))## $Loan.ID
## [1] 81999
##
## $Customer.ID
## [1] 81999
##
## $Loan.Status
## [1] 2
##
## $Current.Loan.Amount
## [1] 22004
##
## $Term
## [1] 2
##
## $Credit.Score
## [1] 325
##
## $Annual.Income
## [1] 36175
##
## $Years.in.current.job
## [1] 12
##
## $Home.Ownership
## [1] 4
##
## $Purpose
## [1] 16
##
## $Monthly.Debt
## [1] 65765
##
## $Years.of.Credit.History
## [1] 506
##
## $Months.since.last.delinquent
## [1] 117
##
## $Number.of.Open.Accounts
## [1] 51
##
## $Number.of.Credit.Problems
## [1] 14
##
## $Current.Credit.Balance
## [1] 32730
##
## $Maximum.Open.Credit
## [1] 44597
##
## $Bankruptcies
## [1] 9
##
## $Tax.Liens
## [1] 13
- Check the unique values with 16 or less unique values
columns <- names(Filter(function(x) length(unique(x)) <= 16, df1))
for (col in columns) {
cat(col, ":\n")
cat("Value counts:\n")
print(table(df1[[col]]))
cat("\n", paste(rep("__", 20), collapse = ""), "\n\n")
}## Loan.Status :
## Value counts:
##
## Charged Off Fully Paid
## 22639 77361
##
## ________________________________________
##
## Term :
## Value counts:
##
## Long Term Short Term
## 27792 72208
##
## ________________________________________
##
## Years.in.current.job :
## Value counts:
##
## < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
## 8164 6460 31121 9134 8169 6143 6787 5686
## 7 years 8 years 9 years n/a
## 5577 4582 3955 4222
##
## ________________________________________
##
## Home.Ownership :
## Value counts:
##
## HaveMortgage Home Mortgage Own Home Rent
## 214 48410 9182 42194
##
## ________________________________________
##
## Purpose :
## Value counts:
##
## Business Loan Buy a Car Buy House
## 1569 1265 678
## Debt Consolidation Educational Expenses Home Improvements
## 78552 99 5839
## major_purchase Medical Bills moving
## 352 1127 150
## other Other renewable_energy
## 6037 3250 10
## small_business Take a Trip vacation
## 283 573 101
## wedding
## 115
##
## ________________________________________
##
## Number.of.Credit.Problems :
## Value counts:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 86035 12077 1299 378 125 49 17 8 4 2 2 2 1
## 15
## 1
##
## ________________________________________
##
## Bankruptcies :
## Value counts:
##
## 0 1 2 3 4 5 6 7
## 88774 10475 417 93 27 7 2 1
##
## ________________________________________
##
## Tax.Liens :
## Value counts:
##
## 0 1 2 3 4 5 6 7 9 10 11 15
## 98062 1343 374 111 58 16 12 7 3 1 2 1
##
## ________________________________________
- Change the string “n/a” values into NA values in “Years.in.current.job” column
df1$Years.in.current.job[df1$Years.in.current.job == "n/a"] <- NA- Check the unique value of “Credit.Score” column
output <- paste(
"Credit.Score:\n",
"Value counts:\n",
paste(capture.output(print(table(df1[["Credit.Score"]]))), collapse = "\n"),
sep = ""
)
cat(output)## Credit.Score:
## Value counts:
##
## 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600
## 12 7 11 20 6 8 9 4 7 10 19 15 23 9 15 25
## 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616
## 29 16 19 13 23 20 13 19 26 28 29 32 23 33 30 34
## 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632
## 31 32 42 26 41 50 28 52 36 39 43 44 30 35 33 67
## 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648
## 58 45 72 62 47 71 42 88 77 81 87 73 99 75 71 76
## 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664
## 126 98 101 122 118 129 147 171 123 118 148 144 168 137 182 173
## 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680
## 161 167 176 227 230 227 208 219 212 262 262 284 234 298 285 346
## 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696
## 277 294 365 327 369 369 367 338 384 320 386 504 509 506 506 483
## 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712
## 552 595 660 612 620 644 717 716 727 791 814 902 865 857 791 1006
## 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728
## 1017 1046 1070 1061 1218 1261 1118 1216 1465 1387 1421 1522 1548 853 806 931
## 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744
## 925 984 1010 1084 1073 1147 1134 1156 1405 1495 1624 1746 1732 1723 1555 1485
## 745 746 747 748 749 750 751 5850 5860 5890 5900 5920 5930 5940 5950 5960
## 1612 1742 1825 1598 827 1234 723 2 1 1 1 1 2 3 3 1
## 6010 6050 6060 6070 6080 6090 6100 6110 6120 6130 6140 6150 6160 6170 6180 6190
## 1 1 2 1 2 4 2 4 2 3 2 3 6 4 5 2
## 6200 6210 6220 6230 6240 6250 6260 6270 6280 6290 6300 6310 6320 6330 6340 6350
## 2 3 7 6 8 4 7 4 3 5 6 4 7 6 4 6
## 6360 6370 6380 6390 6400 6410 6420 6430 6440 6450 6460 6470 6480 6490 6500 6510
## 4 2 9 7 9 12 6 4 7 13 3 6 6 7 13 9
## 6520 6530 6540 6550 6560 6570 6580 6590 6600 6610 6620 6630 6640 6650 6660 6670
## 15 12 7 8 12 11 13 12 11 13 13 21 9 25 19 17
## 6680 6690 6700 6710 6720 6730 6740 6750 6760 6770 6780 6790 6800 6810 6820 6830
## 17 16 17 14 17 16 24 22 12 16 20 18 25 30 21 24
## 6840 6850 6860 6870 6880 6890 6900 6910 6920 6930 6940 6950 6960 6970 6980 6990
## 35 25 28 27 24 29 25 22 29 36 38 39 38 32 36 46
## 7000 7010 7020 7030 7040 7050 7060 7070 7080 7090 7100 7110 7120 7130 7140 7150
## 35 46 37 39 49 44 45 52 50 58 54 45 51 53 62 45
## 7160 7170 7180 7190 7200 7210 7220 7230 7240 7250 7260 7270 7280 7290 7300 7310
## 68 74 57 68 63 57 59 63 84 75 64 74 77 60 73 73
## 7320 7330 7340 7350 7360 7370 7380 7390 7400 7410 7420 7430 7440 7450 7460 7470
## 83 88 87 76 85 106 93 108 97 87 84 70 58 55 76 51
## 7480 7490 7500 7510
## 43 23 24 9
- Pie chart of target variable “Loan Status”
counts <- table(df1$`Loan.Status`)
percentages <- prop.table(counts) * 100
status <- c("Charged Off", "Fully Paid")
labels <- paste(status, "\n", "Count:", counts, "\n", "(", round(percentages, 1), "%)")
par(pin = c(3, 3))
pie(percentages,
labels = labels,
col = c("red", "green"),
main = 'Percentage distribution of "Loan Status" column',
cex = 1.5)Data Exploration Report
- Contain 100000 rows and 19 columns (after deleted 514 rows with all NA values across columns)
- 7 categorical columns and 12 numerical columns
- In “Years.in.current.job” column, all string values of “n/a” is changed into NA values
- “Credit.Score” have anomalies, maximum is 7510 where the range should be around 300 - 850
- “Credit.Score” with more than 850 might have data entry error, where there is 1 random leading zero added
- Should remove extra leading 0 from “Credit.Score” with more than 850
- The target variable, “Loan.Status” distribution is imbalance
Data Cleaning
Duplicates
- Create copy of data frame
df2 <- data.frame(df1)- Check the number of duplicates and percentage of duplicates
cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")## Number of duplicates: 10215
## Percentage of duplicates: 10.215%
- Drop the duplicates
df2 <- df2[!duplicated(df2),]
cat("Number of duplicates: ", sum(duplicated(df2)), "\n", "Percentage of duplicates: ", sum(duplicated(df2)) / nrow(df2) * 100, "%", sep="")## Number of duplicates: 0
## Percentage of duplicates: 0%
Missing Values
- Checking the missing values for each column
for (col in colnames(df2)) {
if (any(is.na(df2[[col]]))) { # Access column by name
cat("Column: ", "\"", col, "\"", "\n",
"Missing Values Count: ", sum(is.na(df2[[col]])), "\n",
"Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
"_____________________________\n\n", sep="")
}
}## Column: "Credit.Score"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
##
## Column: "Annual.Income"
## Missing Values Count: 19154
## Percentage: 21.33%
## _____________________________
##
## Column: "Years.in.current.job"
## Missing Values Count: 3802
## Percentage: 4.23%
## _____________________________
##
## Column: "Months.since.last.delinquent"
## Missing Values Count: 48337
## Percentage: 53.84%
## _____________________________
##
## Column: "Maximum.Open.Credit"
## Missing Values Count: 2
## Percentage: 0%
## _____________________________
##
## Column: "Bankruptcies"
## Missing Values Count: 190
## Percentage: 0.21%
## _____________________________
##
## Column: "Tax.Liens"
## Missing Values Count: 9
## Percentage: 0.01%
## _____________________________
Comments:
“Months.since.last.delinquent” column has more than 50% missing values, column will be dropped
Row with missing value in “Years.in.current.job”, “Bankruptcies”, “Tax.Liens”, “Maximum.Open.Credit” will be dropped since the percentage of missing value is relatively low (less than 5%)
Drop “Months.since.last.delinquent” column
df2$Months.since.last.delinquent <- NULL- Drop rows with missing value in “Years.in.current.job”, “Bankruptcies”, “Tax.Liens”, “Maximum.Open.Credit”
df2 <- df2[complete.cases(df2[, c("Years.in.current.job", "Bankruptcies", "Tax.Liens", "Maximum.Open.Credit")]), ]cat("Number of row: ", nrow(df2), "\n", "Number of column: ", ncol(df2), sep="")## Number of row: 85791
## Number of column: 18
- Checking the missing values for each column
for (col in colnames(df2)) {
if (any(is.na(df2[[col]]))) { # Access column by name
cat("Column: ", "\"", col, "\"", "\n",
"Missing Values Count: ", sum(is.na(df2[[col]])), "\n",
"Percentage: ", round(sum(is.na(df2[[col]])) / nrow(df2) * 100, 2), "%\n",
"_____________________________\n\n", sep="")
}
}## Column: "Credit.Score"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________
##
## Column: "Annual.Income"
## Missing Values Count: 18301
## Percentage: 21.33%
## _____________________________
Handling Anomalies in “Credit.Score”
- Check the distribution of “Credit.Score”
hist(df2$Credit.Score,
main = "Distribution of Credit.Score",
xlab = "Credit.Score",
col = "skyblue",
border = "white")The range of “Credit.Score” should be around 300 - 850
“Credit Score” with data entry error have 1 extra leading zero
1 extra leading zero will be remove from “Credit.Score” more than 850
Remove 1 extra leading zero from “Credit Score” with value more than 850
df2$Credit.Score <- ifelse(df2$Credit.Score > 850,
df2$Credit.Score / 10,
df2$Credit.Score)- Check the current distribution of “Credit.Score”
hist(df2$Credit.Score,
main = "Distribution of Credit.Score",
xlab = "Credit.Score",
col = "skyblue",
border = "white")- Now the all the values of “Credit.Score” in the correct range of 300 - 850
Imputation for missing values in “Credit.Score”
- Describe the “Credit.Score” statistics
describe(df2["Credit.Score"])## vars n mean sd median trimmed mad min max range skew
## Credit.Score 1 67490 718.55 28.18 725 722.81 23.72 585 751 166 -1.45
## kurtosis se
## Credit.Score 2.31 0.11
- Impute missing value in “Credit.Score” by using median value
# Calculate the median of Credit.Score, excluding missing values
median_value <- median(df2$Credit.Score, na.rm = TRUE)
cat('Median for "Credit.Score":', median_value)## Median for "Credit.Score": 725
# Replace missing values in Credit.Score with the median
df2$Credit.Score[is.na(df2$Credit.Score)] <- median_value- Check the current distribution of “Credit.Score”
hist(df2$Credit.Score,
main = "Distribution of Credit.Score",
xlab = "Credit.Score",
col = "skyblue",
border = "white")- Describe the current “Credit.Score” statistics
describe(df2["Credit.Score"])## vars n mean sd median trimmed mad min max range skew
## Credit.Score 1 85791 719.93 25.14 725 723.84 17.79 585 751 166 -1.77
## kurtosis se
## Credit.Score 3.98 0.09
Imputation for missing values in “Annual.Income”
- Check the distribution of “Annual.Income”
hist(log(df2$Annual.Income),
main = "Distribution of Annual.Income",
xlab = "Annual.Income",
col = "skyblue",
border = "white")- Describe the “Annual.Income” statistics
describe(df2["Annual.Income"])## vars n mean sd median trimmed mad min
## Annual.Income 1 67490 1396654 1118572 1198701 1266145 570050.1 76627
## max range skew kurtosis se
## Annual.Income 165557393 165480766 50.36 6924.17 4305.71
- Impute missing value in “Annual.Income” by using median value
# Calculate the median of Annual.Income, excluding missing values
median_value <- median(df2$Annual.Income, na.rm = TRUE)
cat('Median for "Annual.Income":', median_value)## Median for "Annual.Income": 1198701
# Replace missing values in Annual.Income with the median
df2$Annual.Income[is.na(df2$Annual.Income)] <- median_value- Check the current distribution of “Annual.Income”
hist(log(df2$Annual.Income),
main = "Distribution of Annual.Income",
xlab = "Annual.Income",
col = "skyblue",
border = "white")- Describe the current “Annual.Income” statistics
describe(df2["Annual.Income"])## vars n mean sd median trimmed mad min
## Annual.Income 1 85791 1354427 995424.1 1198701 1238856 397568.8 76627
## max range skew kurtosis se
## Annual.Income 165557393 165480766 56.34 8695.7 3398.5
Mapping Years in current job
- Unique values before the mapping.
unique(df2$Years.in.current.job)## [1] "8 years" "10+ years" "3 years" "5 years" "< 1 year" "2 years"
## [7] "4 years" "9 years" "7 years" "1 year" "6 years"
job_years_mapping <- list(
"< 1 year" = 1,
"1 year" = 1,
"2 years" = 2,
"3 years" = 3,
"4 years" = 4,
"5 years" = 5,
"6 years" = 6,
"7 years" = 7,
"8 years" = 8,
"9 years" = 9,
"10+ years" = 10
)df2$`Years.in.current.job` <- sapply(df2$`Years.in.current.job`, function(x) job_years_mapping[[x]])- Unique values after the mapping.
unique(df2$Years.in.current.job)## [1] 8 10 3 5 1 2 4 9 7 6
Mapping Purpose
- Unique values before the mapping.
unique(df2$Purpose)## [1] "Home Improvements" "Debt Consolidation" "Buy House"
## [4] "Business Loan" "Buy a Car" "other"
## [7] "major_purchase" "Take a Trip" "Other"
## [10] "small_business" "Medical Bills" "wedding"
## [13] "vacation" "Educational Expenses" "moving"
## [16] "renewable_energy"
- Since there are so many classes, lets summaries the Purpose by the Loan Status.
crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab##
## Charged Off Fully Paid
## Business Loan 465 857
## Buy a Car 189 922
## Buy House 135 426
## Debt Consolidation 16939 50938
## Educational Expenses 14 67
## Home Improvements 1080 3866
## major_purchase 73 239
## Medical Bills 248 655
## moving 38 87
## other 1315 3588
## Other 552 2200
## renewable_energy 4 4
## small_business 112 132
## Take a Trip 104 358
## vacation 26 57
## wedding 21 80
Debt Consolidation should be treated as its own class as it represents a significant portion of the dataset.
- Map the purposes into aggregrated categories
# Define the purpose mapping
purpose_mapping <- list(
"Debt Consolidation" = c("Debt Consolidation"),
"Personal Expenses" = c("Medical Bills", "wedding", "Educational Expenses", "moving", "vacation", "Take a Trip"),
"Home and Property" = c("Home Improvements", "Buy House", "renewable_energy"),
"Business and Entrepreneurship" = c("Business Loan", "small_business"),
"Transportation" = c("Buy a Car"),
"Other" = c("major_purchase", "Other", "other")
)
# Reverse the mapping
reversed_purpose_mapping <- unlist(lapply(names(purpose_mapping), function(category) {
setNames(rep(category, length(purpose_mapping[[category]])), purpose_mapping[[category]])
}))
# Map the reversed purpose mapping to the 'Purpose' column
df2$Purpose <- sapply(df2$Purpose, function(x) reversed_purpose_mapping[x])crosstab <- table(df2$Purpose, df2$Loan.Status)
crosstab##
## Charged Off Fully Paid
## Business and Entrepreneurship 577 989
## Debt Consolidation 16939 50938
## Home and Property 1219 4296
## Other 1940 6027
## Personal Expenses 451 1304
## Transportation 189 922
All the redundant classes have been categorised and are ready for further analyses.
Mapping Home Ownership
- Unique values before the mapping.
unique(df2$Home.Ownership)## [1] "Home Mortgage" "Own Home" "Rent" "HaveMortgage"
crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab##
## Charged Off Fully Paid
## HaveMortgage 28 147
## Home Mortgage 9353 32396
## Own Home 1891 5614
## Rent 10043 26319
Since the HaveMortgage has the same meaning with HomeMortgage without considering loan status, it is possible to group this two classes together to simplify the analysis without affecting the distribution. Own home is not classified under the home mortgage because the home is 100% totally own by the owner without having to pay any property loan to maintain the ownership of the house.
ownership_mapping <- list(
"Home Mortgage" = c("Home Mortgage", "HaveMortgage"),
"Rent" = c("Rent"),
"Own Home" = c("Own Home")
)
reversed_ownership_mapping <- unlist(lapply(names(ownership_mapping), function(category) {
setNames(rep(category, length(ownership_mapping[[category]])), ownership_mapping[[category]])
}))df2$Home.Ownership <- sapply(df2$Home.Ownership, function(x) reversed_ownership_mapping[x])- Summary table of Home Ownership classes by Loan Status after the mapping.
crosstab <- table(df2$Home.Ownership, df2$Loan.Status)
crosstab##
## Charged Off Fully Paid
## Home Mortgage 9381 32543
## Own Home 1891 5614
## Rent 10043 26319
“HaveMortgage” have been merged with “Home Mortgage”.
Drop Loan ID and Customer ID
# Columns to drop
drop <- c("Customer.ID", "Loan.ID")
# Drop the columns
df2 <- df2[, !(names(df2) %in% drop)]df2
The cleaned version of the dataframe.
head(df2)## Loan.Status Current.Loan.Amount Term Credit.Score Annual.Income
## 1 Fully Paid 445412 Short Term 709 1167493
## 2 Fully Paid 262328 Short Term 725 1198701
## 3 Fully Paid 99999999 Short Term 741 2231892
## 4 Fully Paid 347666 Long Term 721 806949
## 5 Fully Paid 176220 Short Term 725 1198701
## 6 Charged Off 206602 Short Term 729 896857
## Years.in.current.job Home.Ownership Purpose Monthly.Debt
## 1 8 Home Mortgage Home and Property 5214.74
## 2 10 Home Mortgage Debt Consolidation 33295.98
## 3 8 Own Home Debt Consolidation 29200.53
## 4 3 Own Home Debt Consolidation 8741.90
## 5 5 Rent Debt Consolidation 20639.70
## 6 10 Home Mortgage Debt Consolidation 16367.74
## Years.of.Credit.History Number.of.Open.Accounts Number.of.Credit.Problems
## 1 17.2 6 1
## 2 21.1 35 0
## 3 14.9 18 1
## 4 12.0 9 0
## 5 6.1 15 0
## 6 17.3 6 0
## Current.Credit.Balance Maximum.Open.Credit Bankruptcies Tax.Liens
## 1 228190 416746 1 0
## 2 229976 850784 0 0
## 3 297996 750090 0 0
## 4 256329 386958 0 0
## 5 253460 427174 0 0
## 6 215308 272448 0 0
- Remove duplicate 10215 rows of duplicates
- “Months since last delinquent” column has more than 50% missing values, column will be dropped
- Row with missing value in “Years in current job”, “Bankruptcies”, “Tax Liens”, “Maximum Open Credit” will be dropped since the percentage of missing value is relatively low (less than 5%)
- Removed 1 extra leading zero from “Credit Score” with more than 850
- Imputed missing value in “Credit Score” column by using median value
- Imputed missing value in “Annual Income” column by using median value
- Map the “Years in current job” column from string into integer
- Map the “Purpose” column, from 16 categorical group into only 6 categorical group
- Map the “Home Ownership” column, from 4 categorical group into 3 categorical groups
- Dropped “Loan ID” and “Customer ID”
Exploratory Data Analysis
- In this section, the relationship between the target variable and the independent features will be explored.
- There are two problems to be explored: For classification problem, the Loan.Status column will be selected as the target variable. For regression problem, the Credit.Score column will be selected as the target variable.
Categorical Features
cat <- cbind(Credit.Score = df2$Credit.Score,df2[, sapply(df2, function(x) !is.numeric(x))])
head(cat)## Credit.Score Loan.Status Term Home.Ownership Purpose
## 1 709 Fully Paid Short Term Home Mortgage Home and Property
## 2 725 Fully Paid Short Term Home Mortgage Debt Consolidation
## 3 741 Fully Paid Short Term Own Home Debt Consolidation
## 4 721 Fully Paid Long Term Own Home Debt Consolidation
## 5 725 Fully Paid Short Term Rent Debt Consolidation
## 6 729 Charged Off Short Term Home Mortgage Debt Consolidation
- The target variable for each problems and the categorical features are chosen for the categorical features analyses.
Relationship between the target variable and Term
cat$Term <- factor(cat$Term)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))
order <- cat %>%
count(Term) %>%
arrange(desc(n)) %>%
pull(Term)
cat$Term <- factor(cat$Term, levels = order)
# Loan Status by Term
p1 <- ggplot(cat, aes(x = Term, fill = Loan.Status)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
labs(x = "Term", y = "Count", title = "Loan Status by Term") +
theme_minimal() +
theme(
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_discrete(limits = order)
# Credit Score by Term
p2 <- ggplot(cat, aes(x = Term, y = Credit.Score)) +
geom_boxplot(aes(fill = Term)) +
labs(x = "Term", y = "Credit Score", title = "Credit Score by Term") +
theme_minimal() +
theme(
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
p1 + p2
For short term, the number of customers who are able to fully paid their
loan are higher than the charged off customers. The median credit score
of the short term customers are also high even though there are few
lower outliers. For long term, the distribution between fully paid and
charged of loan is more balanced but the number of fully paid are still
higher than charged off. The median credit score of the long term
customer is also still considered good but lower compared to the median
of short term credit score.
Customers tend to pay in full for short term loan compared to long term loan. They also tend to have a higher credit score because most of them are able to fully paid their loans compared to the long term customer.
Relationship between the target variable and Home.Ownership
crosstab <- table(cat$Home.Ownership, cat$Loan.Status)
crosstab##
## Fully Paid Charged Off
## Home Mortgage 32543 9381
## Own Home 5614 1891
## Rent 26319 10043
order_home_ownership <- cat %>%
count(Home.Ownership) %>%
arrange(desc(n)) %>%
pull(Home.Ownership)
cat$Home.Ownership <- factor(cat$Home.Ownership, levels = order_home_ownership)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Fully Paid", "Charged Off"))
# Loan Status by Home Ownership
p1 <- ggplot(cat, aes(x = Home.Ownership, fill = Loan.Status)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
labs(x = "Home Ownership", y = "Count", title = "Loan Status by Home Ownership") +
theme_minimal() +
theme(
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Credit Score by Home Ownership
p2 <- ggplot(cat, aes(x = Home.Ownership, y = Credit.Score)) +
geom_boxplot(aes(fill = Home.Ownership)) +
labs(x = "Home Ownership", y = "Credit Score", title = "Credit Score by Home Ownership") +
theme_minimal() +
theme(
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
p1 + p2
For home mortgage, customers tend to fully paid their loan compared to
being charged off. The median credit score of customers in this class
can be considered the same as the median credit score of the own home
class. For rent, this class has the highest numbers of charged off
customers and the lowest median credit score among all classes. The
number of fully paid loans are higher than charged off for all classes
while customers in rent class have a higher difficulty in paying their
loans compared to other classes which results in lower median credit
score.
Relationship between the target variable and Purpose
order_purpose <- cat %>%
count(Purpose) %>%
arrange(n) %>%
pull(Purpose)
cat$Purpose <- factor(cat$Purpose, levels = order_purpose)
cat$Loan.Status <- factor(cat$Loan.Status, levels = c("Charged Off", "Fully Paid"))
# Loan Status by Purpose
p1 <- ggplot(cat, aes(y = Purpose, fill = Loan.Status)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
labs(x = "Count", y = "Purpose", title = "Loan Status by Purpose") +
theme_minimal() +
theme(
plot.title = element_text(size = 30, hjust = 0.5),
axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22),
axis.text = element_text(size = 20),
legend.title = element_text(size = 20),
legend.text = element_text(size = 20)
)
# Credit Score by Purpose
p2 <- ggplot(cat, aes(x = Purpose, y = Credit.Score)) +
geom_boxplot(aes(fill = Purpose)) +
labs(x = "Purpose", y = "Credit Score", title = "Credit Score by Purpose") +
theme_minimal() +
theme(
plot.title = element_text(size = 30, hjust = 0.5),
axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22),
axis.text = element_text(size = 20),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_text(size = 20),
legend.text = element_text(size = 20)
)
p1 / p2
The majority of customers who applied loans for debt consolidation
purposes, and most of these loan status are “Fully Paid.” However, a
significant portion is also “Charged Off.” The majority of loans are
fully paid for all classes, with debt consolidation being the exception
with a much higher loan count.
Customers who applied loans for transportation tend to have the highest credit score which indicates they are lower-risk customers. The customers who applied loans for debt consolidation, home and property or the other purposes have moderate credit scores. The customers who applied loans for personal expenses or business purpose have lower and more varied credit scores which indicates that they are higher-risks customers.
Chi-square Test
For the classification problem, the chi-square test is used to determine the significance of the variables.
# Encode Loan.Status
cat$Loan.Status <- as.numeric(factor(cat$Loan.Status))
perform_chi_square_test <- function(var1, var2) {
table <- table(var1, var2)
chi_square_result <- chisq.test(table)
result <- data.frame(
Variable1 = deparse(substitute(var1)),
Variable2 = deparse(substitute(var2)),
Chi_Square = chi_square_result$statistic,
DF = chi_square_result$parameter,
P_Value = chi_square_result$p.value
)
return(result)
}chisquareresults <- bind_rows(
perform_chi_square_test(cat$Loan.Status, cat$Term),
perform_chi_square_test(cat$Loan.Status, cat$Purpose),
perform_chi_square_test(cat$Loan.Status, cat$Home.Ownership)
)
chisquareresults## Variable1 Variable2 Chi_Square DF P_Value
## X-squared...1 cat$Loan.Status cat$Term 1776.9359 1 0.000000e+00
## X-squared...2 cat$Loan.Status cat$Purpose 181.6589 5 2.366598e-37
## X-squared...3 cat$Loan.Status cat$Home.Ownership 287.2476 2 4.216768e-63
All categorical features for the classification problem are significant since their p-value is less than 0.05.
Analysis of Variance (ANOVA)
For the regression problem, the ANOVA is used to determine the significance of the variables.
perform_anova <- function(data, category_column, value_column, alpha = 0.05) {
formula <- as.formula(paste(value_column, "~", category_column))
anova_result <- aov(formula, data = data)
summary_result <- summary(anova_result)[[1]]
summary_result <- summary_result[summary_result[[1]] != "Residuals", ]
# F-statistic and p-value
f_statistic <- summary_result[1, "F value"]
p_value <- summary_result[1, "Pr(>F)"]
# F-critical value
dfn <- length(unique(data[[category_column]])) - 1
dfd <- nrow(data) - length(unique(data[[category_column]]))
f_critical <- qf(1 - alpha, dfn, dfd)
# Signifincance
f_statistic_significance <- ifelse(f_statistic > f_critical, "Significant", "Not significant")
p_value_significance <- ifelse(p_value < alpha, "Significant", "Not significant")
result <- list(
Feature = category_column,
`F-statistic` = round(f_statistic, 6),
`F-critical value` = round(f_critical, 6),
`p-value` = round(p_value, 6),
`F-statistic Significance` = f_statistic_significance,
`p-value Significance` = p_value_significance
)
return(result)
}anova_results <- bind_rows(
perform_anova(cat, "Term", "Credit.Score"),
perform_anova(cat, "Purpose", "Credit.Score"),
perform_anova(cat, "Home.Ownership", "Credit.Score")
)
anova_results## # A tibble: 3 × 6
## Feature `F-statistic` `F-critical value` `p-value` F-statistic Signific…¹
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Term 19382. 3.84 0 Significant
## 2 Purpose 180. 2.21 0 Significant
## 3 Home.Owners… 79.6 3.00 0 Significant
## # ℹ abbreviated name: ¹`F-statistic Significance`
## # ℹ 1 more variable: `p-value Significance` <chr>
All categorical features for the regression problem are significant since their p-value is less than 0.05.
Numerical Features
- Numerical features are chosen for the analyses.
num <- df2 %>% select_if(is.numeric)
head(num)## Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1 445412 709 1167493 8
## 2 262328 725 1198701 10
## 3 99999999 741 2231892 8
## 4 347666 721 806949 3
## 5 176220 725 1198701 5
## 6 206602 729 896857 10
## Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1 5214.74 17.2 6
## 2 33295.98 21.1 35
## 3 29200.53 14.9 18
## 4 8741.90 12.0 9
## 5 20639.70 6.1 15
## 6 16367.74 17.3 6
## Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
## 1 1 228190 416746
## 2 0 229976 850784
## 3 1 297996 750090
## 4 0 256329 386958
## 5 0 253460 427174
## 6 0 215308 272448
## Bankruptcies Tax.Liens
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Correlation Matrix
corrplot(cor(num),
method = "shade",
mar = c(0, 0, 11, 0),
title = "Correlation Matrix",
cex.main = 6,
tl.cex = 5.3,
cl.cex = 2,
addCoef.col = "black",
number.cex = 3,
diag = F)- High positive correlation between Bankruptcies and Number of Credit Problems which is 0.75.
- Moderately high correlation between Tax Liens and Number of Credit Problems is 0.58.
- Moderate correlation between Monthly Debt and Annual Income is 0.41.
- All other variables have low correlations.
Relationship between the target variable and Years in Current Job
df2$Years.in.current.job <- factor(df2$Years.in.current.job,
levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))
# Loan Status by Years in Current Job
p1 <- ggplot(df2, aes(x = Years.in.current.job, fill = `Loan.Status`)) +
geom_bar(position = "dodge") +
scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '10+ years')) +
scale_fill_manual(values = c("Fully Paid" = "#66C2A5", "Charged Off" = "#FC8D62")) +
labs(x = "Years in Current Job", y = "Count", title = "Loan Status by Years in Current Job") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 20, hjust = 0.5),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
legend.title = element_text(size = 16),
legend.text = element_text(size = 14))
# Credit Score by Years in Current Job
p2 <- ggplot(df2, aes(x = Years.in.current.job, y = Credit.Score)) +
geom_boxplot(aes(fill = Years.in.current.job)) +
scale_x_discrete(labels = c('1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years', '10+ years')) +
labs(x = "Years in Current Job", y = "Credit Score", title = "Credit Score by Years in Current Job") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 20, hjust = 0.5),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
legend.title = element_text(size = 16),
legend.text = element_text(size = 14))
p1 / p2df2$Years.in.current.job <- as.numeric(df2$Years.in.current.job)10+ years have the highest count of fully Paid loans compared to all other groups which indicates that customers with longer job tenures tend to have better repayment records. Credit scores are stable across all classes.
Outliers Detection
- Find the features where zero value makes the majority proportions of the data.
threshold_percentage <- 50
for (col in colnames(num)) {
most_frequent_value <- names(sort(table(num[[col]]), decreasing = TRUE))[1]
if (most_frequent_value == 0) {
frequency <- max(table(num[[col]]))
percentage <- (frequency / length(num[[col]])) * 100
if (percentage > threshold_percentage) {
cat("Column:", col, "\n")
cat("Unique values:", paste(unique(num[[col]]), collapse = ", "), "\n")
cat("Most frequent value:", most_frequent_value, "\n")
cat("Frequency:", frequency, "\n")
cat("Percentage:", sprintf("%.2f", percentage), "%\n")
cat("\n")
}
}
}## Column: Number.of.Credit.Problems
## Unique values: 1, 0, 5, 2, 4, 3, 7, 6, 11, 15, 8, 10, 9, 12
## Most frequent value: 0
## Frequency: 74430
## Percentage: 86.76 %
##
## Column: Bankruptcies
## Unique values: 1, 0, 2, 3, 5, 4, 7, 6
## Most frequent value: 0
## Frequency: 76864
## Percentage: 89.59 %
##
## Column: Tax.Liens
## Unique values: 0, 1, 4, 2, 3, 6, 5, 11, 15, 7, 10, 9
## Most frequent value: 0
## Frequency: 84231
## Percentage: 98.18 %
Since most of the data points are greater than 0 but still consist of relatively small integers, it’s possible that the outlier removal process excludes all values except 0. Thus, exclude these features from the outlier detection.
num_col <- colnames(num)
remove <- c('Number.of.Credit.Problems', 'Bankruptcies', 'Tax.Liens')
num_col <- setdiff(num_col, remove)num <- num[, num_col]
head(num)## Current.Loan.Amount Credit.Score Annual.Income Years.in.current.job
## 1 445412 709 1167493 8
## 2 262328 725 1198701 10
## 3 99999999 741 2231892 8
## 4 347666 721 806949 3
## 5 176220 725 1198701 5
## 6 206602 729 896857 10
## Monthly.Debt Years.of.Credit.History Number.of.Open.Accounts
## 1 5214.74 17.2 6
## 2 33295.98 21.1 35
## 3 29200.53 14.9 18
## 4 8741.90 12.0 9
## 5 20639.70 6.1 15
## 6 16367.74 17.3 6
## Current.Credit.Balance Maximum.Open.Credit
## 1 228190 416746
## 2 229976 850784
## 3 297996 750090
## 4 256329 386958
## 5 253460 427174
## 6 215308 272448
- Visualise outliers using boxplot.
num_long <- num %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
# Create the boxplot
ggplot(num_long, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels to 90 degrees
labs(title = "Outlier", x = "Feature", y = "Value") +
theme_minimal() +
theme(plot.title = element_text(size = 20, hjust = 0.5),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "none")Encoding
encode <- function(data, columns) {
for (col in columns) {
# Print unique values before encoding
cat("Before encoding:", "\n")
print(unique(data[[col]]))
# Encoding the categorical column
data[[col]] <- as.integer(factor(data[[col]]))
# Print the encoded unique values
cat("After encoding:", "\n")
print(unique(data[[col]]))
cat("\n")
}
return(data)
}cat_col <- c('Loan.Status', 'Term', 'Home.Ownership', 'Purpose')- Encode the categorical features of df2.
df2 <- encode(df2, cat_col)## Before encoding:
## [1] "Fully Paid" "Charged Off"
## After encoding:
## [1] 2 1
##
## Before encoding:
## [1] "Short Term" "Long Term"
## After encoding:
## [1] 2 1
##
## Before encoding:
## [1] "Home Mortgage" "Own Home" "Rent"
## After encoding:
## [1] 1 2 3
##
## Before encoding:
## [1] "Home and Property" "Debt Consolidation"
## [3] "Business and Entrepreneurship" "Transportation"
## [5] "Other" "Personal Expenses"
## After encoding:
## [1] 3 2 1 6 4 5
Variance Inflation Factor (VIF)
VIF is used to detect multicollinearity among independent variables.
Classification Problem
# VIF for classification problem
lm_classification <- lm(as.formula(paste("Loan.Status ~", paste(setdiff(colnames(df2), "Loan.Status"), collapse = "+"))), data = df2)
vif_classification_results <- vif(lm_classification)
vif_classification <- data.frame(Feature = names(vif_classification_results),
VIF = vif_classification_results) %>%
arrange(desc(VIF))ggplot(vif_classification, aes(x = reorder(Feature, VIF), y = VIF)) +
geom_bar(stat = "identity", fill = "#66C2A5") +
geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
coord_flip() +
labs(
title = "Variance Inflation Factor (VIF) for Classification Problem Features",
x = "Feature",
y = "VIF"
) +
theme_minimal(base_size = 15) +
theme(plot.title = element_text(hjust = 0.5, size =55),
axis.title.x = element_text(size = 45),
axis.title.y = element_text(size = 45),
axis.text = element_text(size = 40))
The VIF of all independent features of the classification problem are
acceptable since VIF are less than 10.
Regression Problem
# VIF for classification problem
lm_regression <- lm(as.formula(paste("Credit.Score ~", paste(setdiff(colnames(df2), "Credit.Score"), collapse = "+"))), data = df2)
vif_regression_results <- vif(lm_regression)
vif_regression <- data.frame(Feature = names(vif_regression_results),
VIF = vif_regression_results) %>%
arrange(desc(VIF))ggplot(vif_regression, aes(x = reorder(Feature, VIF), y = VIF)) +
geom_bar(stat = "identity", fill = "#FC8D62") +
geom_text(aes(label = round(VIF, 3)), hjust = -0.3, size = 15) +
coord_flip() +
labs(
title = "Variance Inflation Factor (VIF) for Regression Problem Features",
x = "Feature",
y = "VIF"
) +
theme_minimal(base_size = 15) +
theme(plot.title = element_text(hjust = 0.5, size =55),
axis.title.x = element_text(size = 45),
axis.title.y = element_text(size = 45),
axis.text = element_text(size = 40))
The VIF of all independent features of the regression problem are
acceptable since VIF are less than 10.
Modelling
Classification Problem
- For the classification problem, the loan status is predicted using target Loan.Status column: “Fully Paid” for eligible customers & “Charged Off” for ineligible customers.
Data Splitting
- The dataset is split into training set (0.8) and test set (0.2) by stratified splitting. The Loan.Status column is defined as the target variable.
set.seed(123)
classet <- createDataPartition(df2$Loan.Status, p=0.8, list=F)
clastrain <- df2[classet, ]
clastest <- df2[-classet, ]
prop.table(table(df2$Loan.Status))##
## 1 2
## 0.2484526 0.7515474
prop.table(table(clastrain$Loan.Status))##
## 1 2
## 0.2492678 0.7507322
prop.table(table(clastest$Loan.Status))##
## 1 2
## 0.2451917 0.7548083
The class distribution of Loan.Status remains the same across all set.
Resampling using SMOTE
- To address the imbalanced dataset, SMOTE (Synthetic Minority Oversampling Technique) was applied.
classmote <- SMOTE(X = clastrain[, -1],
target = clastrain$Loan.Status,
K = 5,
dup_size = 2)
clastrain1 <- data.frame(classmote$data)
names(clastrain1)[ncol(clastrain1)] <- "Loan.Status"
clastrain1$Loan.Status <- as.factor(clastrain1$Loan.Status)
clastrain1_x <- as.matrix(clastrain1[, -which(names(clastrain1) == "Loan.Status")])
clastrain1_y <- as.numeric(clastrain1$Loan.Status) - 1
clastest_x <- as.matrix(clastest[, -which(names(clastest) == "Loan.Status")])
clastest_y <- as.numeric(clastest$Loan.Status) - 1prop.table(table(clastrain$Loan.Status))##
## 1 2
## 0.2492678 0.7507322
prop.table(table(clastrain1$Loan.Status))##
## 1 2
## 0.4990228 0.5009772
The minority class is duplicated 2 times to get an equal distribution with the majority class.
Model Training
- Two models were chosen for the project Random Forest & XGBoost.
- Random Forest is an ensemble machine learning method based on decision trees and it aggregates the results of multiple decision trees to improve accuracy and reduce overfitting.
rf_classifier <- randomForest(Loan.Status ~ ., data = clastrain1,
ntree = 100,
importance = TRUE)- XGBoost (Extreme Gradient Boosting) is an optimized implementation of gradient boosting designed to improve performance and speed.
xgb_classifier <- xgboost(
data = clastrain1_x,
label = clastrain1_y,
nrounds = 100,
objective = 'binary:logistic'
)## [1] train-logloss:0.596056
## [2] train-logloss:0.542586
## [3] train-logloss:0.509624
## [4] train-logloss:0.486869
## [5] train-logloss:0.468468
## [6] train-logloss:0.450323
## [7] train-logloss:0.440073
## [8] train-logloss:0.427220
## [9] train-logloss:0.419883
## [10] train-logloss:0.412799
## [11] train-logloss:0.409033
## [12] train-logloss:0.401297
## [13] train-logloss:0.398308
## [14] train-logloss:0.393384
## [15] train-logloss:0.391144
## [16] train-logloss:0.387098
## [17] train-logloss:0.383878
## [18] train-logloss:0.380269
## [19] train-logloss:0.378728
## [20] train-logloss:0.377309
## [21] train-logloss:0.375065
## [22] train-logloss:0.369018
## [23] train-logloss:0.361935
## [24] train-logloss:0.359263
## [25] train-logloss:0.358279
## [26] train-logloss:0.355269
## [27] train-logloss:0.353848
## [28] train-logloss:0.352044
## [29] train-logloss:0.351285
## [30] train-logloss:0.350476
## [31] train-logloss:0.349817
## [32] train-logloss:0.346267
## [33] train-logloss:0.345514
## [34] train-logloss:0.344774
## [35] train-logloss:0.342828
## [36] train-logloss:0.342304
## [37] train-logloss:0.341778
## [38] train-logloss:0.338915
## [39] train-logloss:0.338116
## [40] train-logloss:0.337389
## [41] train-logloss:0.337218
## [42] train-logloss:0.335160
## [43] train-logloss:0.334314
## [44] train-logloss:0.333590
## [45] train-logloss:0.333170
## [46] train-logloss:0.332305
## [47] train-logloss:0.331696
## [48] train-logloss:0.331025
## [49] train-logloss:0.330768
## [50] train-logloss:0.330379
## [51] train-logloss:0.329106
## [52] train-logloss:0.328854
## [53] train-logloss:0.328327
## [54] train-logloss:0.326051
## [55] train-logloss:0.325576
## [56] train-logloss:0.324394
## [57] train-logloss:0.323411
## [58] train-logloss:0.321972
## [59] train-logloss:0.321335
## [60] train-logloss:0.320433
## [61] train-logloss:0.320057
## [62] train-logloss:0.319495
## [63] train-logloss:0.319166
## [64] train-logloss:0.317997
## [65] train-logloss:0.317380
## [66] train-logloss:0.316968
## [67] train-logloss:0.316569
## [68] train-logloss:0.315874
## [69] train-logloss:0.315299
## [70] train-logloss:0.314597
## [71] train-logloss:0.313909
## [72] train-logloss:0.313707
## [73] train-logloss:0.313235
## [74] train-logloss:0.312517
## [75] train-logloss:0.311922
## [76] train-logloss:0.311067
## [77] train-logloss:0.310525
## [78] train-logloss:0.310114
## [79] train-logloss:0.309645
## [80] train-logloss:0.309160
## [81] train-logloss:0.308734
## [82] train-logloss:0.308627
## [83] train-logloss:0.308426
## [84] train-logloss:0.307926
## [85] train-logloss:0.307294
## [86] train-logloss:0.307170
## [87] train-logloss:0.307060
## [88] train-logloss:0.306673
## [89] train-logloss:0.306314
## [90] train-logloss:0.306202
## [91] train-logloss:0.305896
## [92] train-logloss:0.305518
## [93] train-logloss:0.304962
## [94] train-logloss:0.304567
## [95] train-logloss:0.304233
## [96] train-logloss:0.303952
## [97] train-logloss:0.303403
## [98] train-logloss:0.302593
## [99] train-logloss:0.302280
## [100] train-logloss:0.301792
Test the Model
rf_clas_pred_prob <- predict(rf_classifier, newdata = clastest, type = "prob")
rf_clas_pred <- ifelse(rf_clas_pred_prob[, 2] > 0.5, 1, 0)
xgb_clas_pred_prob <- predict(xgb_classifier, clastest_x)
xgb_clas_pred <- ifelse(xgb_clas_pred_prob > 0.5, 1, 0)Regression Problem
Data Splitting
- The dataset is split into training set (0.8) and test set (0.2). The Credit.Score column is defined as the target variable.
set.seed(123)
regset <- createDataPartition(df2$Credit.Score, p=0.8, list=F)
regtrain <- df2[regset, ]
regtest <- df2[-regset, ]
regtrain_x <- as.matrix(regtrain[, -which(names(regtrain) == "Credit.Score")])
regtrain_y <- regtrain$Credit.Score
regtest_x <- as.matrix(regtest[, -which(names(regtest) == "Credit.Score")])
regtest_y <- regtest$Credit.ScoreModel Training
Similar to the classification problem, Random Forest & XGBoost were also choosen for the regression problem
Random Forest
rf_reg <- randomForest(Credit.Score ~ ., data=regtrain, ntree = 50)- XGBoost
xgb_reg <- xgboost(
data = regtrain_x,
label = regtrain_y,
nrounds = 100
)## [1] train-rmse:504.152902
## [2] train-rmse:353.258614
## [3] train-rmse:247.772815
## [4] train-rmse:174.113107
## [5] train-rmse:122.808234
## [6] train-rmse:87.233804
## [7] train-rmse:62.787166
## [8] train-rmse:46.273409
## [9] train-rmse:35.424976
## [10] train-rmse:28.599793
## [11] train-rmse:24.494535
## [12] train-rmse:22.172118
## [13] train-rmse:20.899739
## [14] train-rmse:20.196642
## [15] train-rmse:19.802898
## [16] train-rmse:19.535091
## [17] train-rmse:19.395052
## [18] train-rmse:19.305943
## [19] train-rmse:19.223774
## [20] train-rmse:19.146398
## [21] train-rmse:19.062564
## [22] train-rmse:19.009610
## [23] train-rmse:18.957081
## [24] train-rmse:18.904423
## [25] train-rmse:18.859871
## [26] train-rmse:18.806744
## [27] train-rmse:18.785699
## [28] train-rmse:18.760868
## [29] train-rmse:18.716777
## [30] train-rmse:18.696336
## [31] train-rmse:18.660398
## [32] train-rmse:18.616775
## [33] train-rmse:18.597696
## [34] train-rmse:18.551940
## [35] train-rmse:18.538307
## [36] train-rmse:18.487860
## [37] train-rmse:18.451868
## [38] train-rmse:18.444705
## [39] train-rmse:18.405782
## [40] train-rmse:18.382120
## [41] train-rmse:18.365880
## [42] train-rmse:18.344140
## [43] train-rmse:18.322081
## [44] train-rmse:18.289515
## [45] train-rmse:18.266673
## [46] train-rmse:18.259950
## [47] train-rmse:18.234204
## [48] train-rmse:18.209596
## [49] train-rmse:18.173723
## [50] train-rmse:18.144786
## [51] train-rmse:18.126122
## [52] train-rmse:18.093391
## [53] train-rmse:18.066887
## [54] train-rmse:18.045779
## [55] train-rmse:18.020931
## [56] train-rmse:17.995193
## [57] train-rmse:17.975349
## [58] train-rmse:17.942384
## [59] train-rmse:17.926082
## [60] train-rmse:17.906469
## [61] train-rmse:17.884972
## [62] train-rmse:17.864283
## [63] train-rmse:17.845357
## [64] train-rmse:17.830017
## [65] train-rmse:17.822264
## [66] train-rmse:17.806078
## [67] train-rmse:17.803143
## [68] train-rmse:17.777942
## [69] train-rmse:17.745179
## [70] train-rmse:17.740964
## [71] train-rmse:17.708373
## [72] train-rmse:17.695856
## [73] train-rmse:17.671558
## [74] train-rmse:17.633395
## [75] train-rmse:17.615272
## [76] train-rmse:17.598380
## [77] train-rmse:17.588348
## [78] train-rmse:17.566782
## [79] train-rmse:17.542397
## [80] train-rmse:17.528464
## [81] train-rmse:17.502755
## [82] train-rmse:17.491953
## [83] train-rmse:17.478933
## [84] train-rmse:17.455682
## [85] train-rmse:17.442524
## [86] train-rmse:17.435676
## [87] train-rmse:17.424583
## [88] train-rmse:17.407363
## [89] train-rmse:17.381464
## [90] train-rmse:17.365696
## [91] train-rmse:17.363838
## [92] train-rmse:17.360838
## [93] train-rmse:17.345311
## [94] train-rmse:17.321054
## [95] train-rmse:17.302149
## [96] train-rmse:17.270130
## [97] train-rmse:17.250148
## [98] train-rmse:17.225718
## [99] train-rmse:17.205127
## [100] train-rmse:17.182926
Test the Model
rf_reg_pred <- predict(rf_reg, newdata = regtest_x)
xgb_reg_pred <- predict(xgb_reg, regtest_x)Evaluation
Classification Problem
# Random Forest performance
rf_conf_matrix <- confusionMatrix(factor(rf_clas_pred), factor(clastest_y))
rf_accuracy <- rf_conf_matrix$overall["Accuracy"]
rf_precision <- rf_conf_matrix$byClass["Pos Pred Value"]
rf_recall <- rf_conf_matrix$byClass["Sensitivity"]
rf_f1_score <- rf_conf_matrix$byClass["F1"]
roc_rf <- roc(clastest_y, rf_clas_pred_prob[, 2])## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
rf_auc_roc <- auc(roc_rf)
# XGBoost performance
xgb_conf_matrix <- confusionMatrix(factor(xgb_clas_pred), factor(clastest_y))
xgb_accuracy <- xgb_conf_matrix$overall["Accuracy"]
xgb_precision <- xgb_conf_matrix$byClass["Pos Pred Value"]
xgb_recall <- xgb_conf_matrix$byClass["Sensitivity"]
xgb_f1_score <- xgb_conf_matrix$byClass["F1"]
xgb_roc <- roc(clastest_y, xgb_clas_pred_prob)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
xgb_auc_roc <- auc(xgb_roc)classification_results <- data.frame(
Model = c("Random Forest", "XGBoost"),
Accuracy = c(rf_accuracy, xgb_accuracy),
Precision = c(rf_precision, xgb_precision),
Recall = c(rf_recall, xgb_recall),
F1_Score = c(rf_f1_score, xgb_f1_score),
AUC_ROC = c(rf_auc_roc, xgb_auc_roc)
)
classification_results## Model Accuracy Precision Recall F1_Score AUC_ROC
## 1 Random Forest 0.7458911 0.4567063 0.1918232 0.2701707 0.7037483
## 2 XGBoost 0.7555076 0.5054054 0.1333492 0.2110213 0.6988339
The machine learning models were evaluated based on their performance metrics: Accuracy, Precision, Recall, F1-Score and AUC-ROC. XGBoost has a slightly better accuracy & precision compared to Random Forest. However, Random Forest performs slightly better in terms of the recall, F1-Score and AUC ROC
Regression Problem
evaluate_regression <- function(predictions, actuals) {
# Using caret
r_squared <- R2(pred = predictions, obs = actuals)
rmse <- RMSE(pred = predictions, obs = actuals)
# Compile results into a data frame
results <- data.frame(R_squared = r_squared, RMSE = rmse)
return(results)
}rf_reg_results <- evaluate_regression(rf_reg_pred, regtest_y)
xgb_reg_results <- evaluate_regression(xgb_reg_pred, regtest_y)
regression_results <- data.frame(
Model = c("Random Forest Regressor", "XGBoost Regressor"),
R_Squared = c(rf_reg_results$R_squared, xgb_reg_results$R_squared),
RMSE = c(rf_reg_results$RMSE, xgb_reg_results$RMSE)
)
regression_results## Model R_Squared RMSE
## 1 Random Forest Regressor 0.3828138 19.68817
## 2 XGBoost Regressor 0.3827512 19.68262
For the regression problem, the models were evaluated based on R-Squared and Root Mean Squared Error (RMSE). Both models have nearly identical performance for R-Squared where the difference is only 0.0000626. This is also a similar case for RMSE where they are almost identical with XGBoost has a slightly lower RMSE.
Conclusion
For loan eligibility prediction, both Random Forest and XGBoost were effective. However, the choice between them depends on the problem’s priorities: If accuracy and precision are crucial, XGBoost is slightly better. If recall and overall balance are more critical, Random Forest performs better.
For predicting credit scores, both models demonstrated similar performance, explaining approximately 38% of the variance in the data. The minimal difference in RMSE suggests either model can be used effectively.
This project successfully demonstrated the application of machine learning for both classification and regression tasks in the context of loan eligibility and credit score prediction. The models provide actionable insights that can aid financial institutions in assessing loan risk and creditworthiness. Future improvements could focus on hyperparameter tuning and feature engineering for better performance.