Indonesia has the world’s 16th largest economy and is the world’s fourth most populous country with 274 million people, out of these 84% below the age of 54. With Internet penetration standing at 67% and financial inclusion remaining a key challenge, Indonesia offers a fertile ground for fintech innovation. Indonesia’s fast-growing fintech ecosystem has been largely driven by a proactive government that has introduced rules in areas including peer-to-peer (P2P) lending, digital payments, and most recently open banking, in hopes to encourage innovation and improve financial inclusion. This favorable landscape has attracted the attention of both local and international investors, many of which are betting big on the prospects of digital financial services in the country.
According to the report, the Indonesian fintech startups landscape is dominated by companies in the online lending space, which account for half of all fintech startups in the country. Online lending is followed by payments with 73 companies (23%), blockchain and cryptocurrencies with 26 companies (8%), and investment/personal finance with 24 companies (7%). Other segments represented include insurtech (15 companies), crowdfunding (9 companies), point-of-sale (POS) services (7 companies), and financial comparison (7 companies).
Our fintech notably from lending sector is huge, President Jokowi said for unbankable borrower have more than Rp1000 trillion prospect but our fintech only can absorp not more than 20% approximately Rp 194,1 trillion. It likes double edge sword, if we can not maintain our p2p lending performance it could be massive failure too. Consequently, We make this project to boost fintech in Indonesia especially from lending sector to reduce the risk factor from default borrower. Our project will do prediction for borrower profile and make prediction whether our borrower will be default or not.
In this project we will use dataset bondora p2p lending. This dataset contain information such as :
Age :The age of the borrower when signing the loan applicationAmount :Amount the borrower received on the Primary Market. This is the principal balance of your purchase from Secondary MarketAmountOfPreviousLoansBeforeLoan :Value of previous loansAppliedAmount :The amount borrower applied for originallyAuctionBidNumber :Unique bid number which is accompanied by Auction numberAuctionId :A unique number given to all auctionsCurrentDebtDaysPrimary :How long the loan has been in Principal DebtCurrentDebtDaysSecondary :How long the loan has been in Interest DebtDateOfBirth :The date of the borrower’s birthDebtOccuredOn :The date when Principal Debt occurredDebtOccuredOnForSecondary :The date when Interest Debt occurredDebtToIncome :Ratio of borrower’s monthly gross income that goes toward paying loansDefaultDate :The date when loan went into defaulted state and collection process was startedSource dataset : https://www.kaggle.com/sid321axn/bondora-peer-to-peer-lending-loan-data
Outpout from this project is dashboard that will show which are the indicator to accepting borrower and default rate from each borrower.
Business Impact : Our business impact has wide scope, not only for p2p lending company but it also can from lender side even borrower can get impact too. Let we see one by one, first from company view our dashboard can be useful for their business as tool for make good TKB90, TKB90 is a measurement of the operators to facilitate the settlement of lending obligation within 90 days since the due date. Second, we see from lender prespective, becoming a lender is not easy, you have to choose which one is worthy borrower that has low percentage of defaulting, with our dashboard lender can easily choose low risk borrower. Third, from borrower view this dashboard can be useful for them, so they can improve their credit score by knowing what are important indicator in calculation.
Import Library
library(dplyr)
library(tidyverse)
library(ggplot2)
library(caret)
library(e1071)
library(randomForest)
library(lubridate)
library(partykit)
library(plotly)
library(ggthemes)
options(scipen = 999)
Read Data
p2p <- read.csv("LoanData_Bondora.csv",sep=",",header=T,na.strings=c("",NA))
glimpse(p2p)
## Rows: 179,235
## Columns: 112
## $ ReportAsOfEOD <chr> "2021-07-20", "2021-07-20", "20…
## $ LoanId <chr> "66AE108B-532B-4BB3-BAB7-0019A4…
## $ LoanNumber <int> 483449, 378148, 451831, 349381,…
## $ ListedOnUTC <chr> "2016-03-23 16:07:19", "2015-06…
## $ BiddingStartedOn <chr> "2016-03-23 16:07:19", "2015-06…
## $ BidsPortfolioManager <int> 970, 1295, 2700, 1115, 305, 600…
## $ BidsApi <int> 1150, 0, 565, 0, 0, 0, 0, 0, 0,…
## $ BidsManual <dbl> 5, 1705, 5835, 385, 785, 175, 0…
## $ UserName <chr> "BO965519", "BOA9K172A", "BO797…
## $ NewCreditCustomer <chr> "False", "False", "True", "True…
## $ LoanApplicationStartedDate <chr> "2016-03-17 12:39:22", "2015-06…
## $ LoanDate <chr> "2016-03-23", "2015-06-25", "20…
## $ ContractEndDate <chr> "2020-06-26", NA, "2019-10-24",…
## $ FirstPaymentDate <chr> "2016-05-12", "2015-08-17", "20…
## $ MaturityDate_Original <chr> "2021-04-12", "2020-07-17", "20…
## $ MaturityDate_Last <chr> "2020-06-26", "2020-07-17", "20…
## $ ApplicationSignedHour <int> 17, 11, 22, 15, 20, 0, 15, 16, …
## $ ApplicationSignedWeekday <int> 4, 5, 3, 3, 3, 5, 2, 4, 2, 4, 3…
## $ VerificationType <int> 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 4…
## $ LanguageCode <int> 1, 1, 1, 6, 6, 6, 1, 6, 4, 1, 1…
## $ Age <int> 53, 50, 44, 42, 34, 31, 22, 47,…
## $ DateOfBirth <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Gender <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0…
## $ Country <chr> "EE", "EE", "EE", "ES", "ES", "…
## $ AppliedAmount <dbl> 2125, 3000, 10630, 1500, 1595, …
## $ Amount <dbl> 2125, 3000, 9100, 1500, 1090, 7…
## $ Interest <dbl> 20.97, 17.12, 13.67, 40.40, 68.…
## $ LoanDuration <int> 60, 60, 60, 60, 48, 60, 60, 12,…
## $ MonthlyPayment <dbl> 62.05, 84.75, 268.57, 63.53, 10…
## $ County <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ City <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ UseOfLoan <int> 2, 3, 3, 2, 7, 7, 8, 6, 2, 0, 7…
## $ Education <int> 4, 5, 4, 2, 4, 4, 2, 3, 4, 4, 4…
## $ MaritalStatus <int> 2, 2, 4, 1, 4, 1, 3, 2, 1, 4, 3…
## $ NrOfDependants <chr> "0", "0", "1", "0", "2", "0", "…
## $ EmploymentStatus <int> 6, 5, 5, 3, 6, 3, 3, 6, 3, 3, 3…
## $ EmploymentDurationCurrentEmployer <chr> "MoreThan5Years", "MoreThan5Yea…
## $ EmploymentPosition <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ WorkExperience <chr> "15To25Years", "MoreThan25Years…
## $ OccupationArea <int> 1, 7, 8, 1, 1, 7, 9, 1, 11, 8, …
## $ HomeOwnershipType <int> 1, 1, 8, 2, 3, 4, 4, 1, 1, 6, 3…
## $ IncomeFromPrincipalEmployer <dbl> 0, 900, 600, 863, 0, 970, 745, …
## $ IncomeFromPension <dbl> 301, 0, 0, 0, 697, 0, 0, 1126, …
## $ IncomeFromFamilyAllowance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeFromSocialWelfare <dbl> 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IncomeFromLeavePay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeFromChildSupport <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeOther <dbl> 0, 0, 600, 0, 0, 0, 0, 0, 0, 0,…
## $ IncomeTotal <dbl> 354, 900, 1200, 863, 697, 970, …
## $ ExistingLiabilities <int> 8, 4, 7, 1, 5, 5, 1, 2, 8, 9, 7…
## $ LiabilitiesTotal <dbl> 485.09, 736.45, 905.00, 350.00,…
## $ RefinanceLiabilities <int> 6, 0, 3, 0, 2, 2, 0, 0, 0, 3, 1…
## $ DebtToIncome <dbl> 26.29, 30.58, 26.71, 7.36, 36.0…
## $ FreeCash <dbl> 10.92, 78.80, 349.43, 449.47, 9…
## $ MonthlyPaymentDay <int> 12, 17, 20, 1, 1, 1, 15, 25, 20…
## $ ActiveScheduleFirstPaymentReached <chr> "True", "True", "True", "True",…
## $ PlannedPrincipalTillDate <dbl> 630.22, 1333.51, 3348.50, 573.3…
## $ PlannedInterestTillDate <dbl> 1251.98, 3000.00, 9100.00, 1500…
## $ LastPaymentOn <chr> "2021-06-16", "2019-06-19", "20…
## $ CurrentDebtDaysPrimary <int> 552, 1918, 1368, 2057, 1995, 20…
## $ DebtOccuredOn <chr> "2020-01-14", "2016-04-18", "20…
## $ CurrentDebtDaysSecondary <int> 630, 1979, 1428, 2057, 1995, 20…
## $ DebtOccuredOnForSecondary <chr> "2019-10-28", "2016-02-17", "20…
## $ ExpectedLoss <dbl> 0.06851187, 0.03079913, 0.02317…
## $ LossGivenDefault <dbl> 0.58, 0.65, 0.58, 0.90, 0.75, 0…
## $ ExpectedReturn <dbl> 0.1411449, 0.1404356, 0.1134841…
## $ ProbabilityOfDefault <dbl> 0.12221594, 0.03644867, 0.04134…
## $ DefaultDate <chr> "2020-01-14", "2016-06-02", "20…
## $ PrincipalOverdueBySchedule <dbl> 1155.84, 2436.41, 0.00, 1035.27…
## $ PlannedPrincipalPostDefault <dbl> 1251.98, 2658.82, 6456.37, 1406…
## $ PlannedInterestPostDefault <dbl> 77.68, 1078.96, 1537.37, 1520.1…
## $ EAD1 <dbl> 1251.98, 2730.84, 6723.01, 1434…
## $ EAD2 <dbl> 64.07, 2370.77, 5014.54, 1078.7…
## $ PrincipalRecovery <dbl> 96.14, 294.43, 4160.01, 399.41,…
## $ InterestRecovery <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, NA,…
## $ RecoveryStage <int> 2, 2, 2, 1, 1, 2, 2, NA, NA, NA…
## $ StageActiveSince <chr> "2020-03-03 09:27:48.493000000"…
## $ ModelVersion <int> 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1…
## $ Rating <chr> "C", "B", "A", "F", "HR", "HR",…
## $ EL_V0 <dbl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Rating_V0 <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ EL_V1 <dbl> NA, 0.03079913, NA, 0.22080971,…
## $ Rating_V1 <chr> NA, "B", NA, "F", NA, "HR", NA,…
## $ Rating_V2 <chr> "C", "B", "A", "HR", "HR", "HR"…
## $ Status <chr> "Late", "Late", "Repaid", "Late…
## $ Restructured <chr> "False", "False", "False", "Fal…
## $ ActiveLateCategory <chr> "180+", "180+", "180+", "180+",…
## $ WorseLateCategory <chr> "180+", "180+", "180+", "180+",…
## $ CreditScoreEsMicroL <chr> NA, NA, NA, "M3", "M5", "M5", N…
## $ CreditScoreEsEquifaxRisk <chr> NA, NA, NA, "B", "C", "C", NA, …
## $ CreditScoreFiAsiakasTietoRiskGrade <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CreditScoreEeMini <int> 1000, 1000, 1000, NA, NA, NA, 8…
## $ PrincipalPaymentsMade <dbl> 969.16, 563.59, 6537.00, 464.73…
## $ InterestAndPenaltyPaymentsMade <dbl> 1187.91, 360.07, 1708.47, 355.9…
## $ PrincipalWriteOffs <dbl> 0.00, 0.00, 2303.33, 0.00, 0.00…
## $ InterestAndPenaltyWriteOffs <dbl> 0.00, 0.00, 0.88, 0.00, 0.00, 0…
## $ PrincipalBalance <dbl> 1155.84, 2436.41, 0.00, 1035.27…
## $ InterestAndPenaltyBalance <dbl> 433.60, 2291.82, 0.00, 2833.83,…
## $ NoOfPreviousLoansBeforeLoan <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2…
## $ AmountOfPreviousLoansBeforeLoan <dbl> 500, 1800, 0, 0, 0, 0, 0, 0, 0,…
## $ PreviousRepaymentsBeforeLoan <dbl> 590.95, 445.26, 0.00, 0.00, 0.0…
## $ PreviousEarlyRepaymentsBefoleLoan <dbl> 0, 3000, 0, 1500, 0, 0, 0, 1500…
## $ PreviousEarlyRepaymentsCountBeforeLoan <int> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1…
## $ GracePeriodStart <chr> "2019-10-28", NA, NA, NA, NA, N…
## $ GracePeriodEnd <chr> "2020-01-27", NA, NA, NA, NA, N…
## $ NextPaymentDate <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NextPaymentNr <int> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NrOfScheduledPayments <int> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ReScheduledOn <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PrincipalDebtServicingCost <dbl> 0.00, 0.00, 259.67, 0.00, 0.00,…
## $ InterestAndPenaltyDebtServicingCost <dbl> 51.73, 47.08, 1659.60, 215.05, …
## $ ActiveLateLastPaymentCategory <chr> "31-60", "180+", "180+", "180+"…
Check dataset dimension
dim(p2p)
## [1] 179235 112
Check missing value mean
colMeans(is.na(p2p))*100
## ReportAsOfEOD LoanId
## 0.000000000 0.000000000
## LoanNumber ListedOnUTC
## 0.000000000 0.000000000
## BiddingStartedOn BidsPortfolioManager
## 0.000000000 0.000000000
## BidsApi BidsManual
## 0.000000000 0.000000000
## UserName NewCreditCustomer
## 0.000000000 0.000000000
## LoanApplicationStartedDate LoanDate
## 0.000000000 0.000000000
## ContractEndDate FirstPaymentDate
## 42.535777052 0.000000000
## MaturityDate_Original MaturityDate_Last
## 0.000000000 0.000000000
## ApplicationSignedHour ApplicationSignedWeekday
## 0.000000000 0.000000000
## VerificationType LanguageCode
## 0.027896337 0.000000000
## Age DateOfBirth
## 0.000000000 100.000000000
## Gender Country
## 0.025106703 0.000000000
## AppliedAmount Amount
## 0.000000000 0.000000000
## Interest LoanDuration
## 0.000000000 0.000000000
## MonthlyPayment County
## 3.732529919 100.000000000
## City UseOfLoan
## 100.000000000 0.000000000
## Education MaritalStatus
## 0.027896337 0.027896337
## NrOfDependants EmploymentStatus
## 80.137807906 0.112701202
## EmploymentDurationCurrentEmployer EmploymentPosition
## 0.896030351 100.000000000
## WorkExperience OccupationArea
## 79.623399448 0.050771334
## HomeOwnershipType IncomeFromPrincipalEmployer
## 0.924484615 0.000000000
## IncomeFromPension IncomeFromFamilyAllowance
## 0.000000000 0.000000000
## IncomeFromSocialWelfare IncomeFromLeavePay
## 0.000000000 0.000000000
## IncomeFromChildSupport IncomeOther
## 0.000000000 0.000000000
## IncomeTotal ExistingLiabilities
## 0.000000000 0.000000000
## LiabilitiesTotal RefinanceLiabilities
## 0.000000000 0.000000000
## DebtToIncome FreeCash
## 0.027896337 0.027896337
## MonthlyPaymentDay ActiveScheduleFirstPaymentReached
## 0.000000000 0.000000000
## PlannedPrincipalTillDate PlannedInterestTillDate
## 69.613077803 1.101347393
## LastPaymentOn CurrentDebtDaysPrimary
## 3.752057355 57.472591849
## DebtOccuredOn CurrentDebtDaysSecondary
## 57.472591849 54.100482607
## DebtOccuredOnForSecondary ExpectedLoss
## 54.100482607 1.472368678
## LossGivenDefault ExpectedReturn
## 1.472368678 1.472368678
## ProbabilityOfDefault DefaultDate
## 1.472368678 60.155103635
## PrincipalOverdueBySchedule PlannedPrincipalPostDefault
## 8.281306664 60.155661562
## PlannedInterestPostDefault EAD1
## 60.155661562 60.156219488
## EAD2 PrincipalRecovery
## 60.156219488 60.155661562
## InterestRecovery RecoveryStage
## 60.155661562 37.545680252
## StageActiveSince ModelVersion
## 33.747872904 1.472368678
## Rating EL_V0
## 1.524813792 97.450832706
## Rating_V0 EL_V1
## 97.450832706 92.789354758
## Rating_V1 Rating_V2
## 92.789354758 85.972047870
## Status Restructured
## 0.000000000 0.000000000
## ActiveLateCategory WorseLateCategory
## 58.164421012 29.195748598
## CreditScoreEsMicroL CreditScoreEsEquifaxRisk
## 16.907412057 93.182693112
## CreditScoreFiAsiakasTietoRiskGrade CreditScoreEeMini
## 76.663040143 41.363572963
## PrincipalPaymentsMade InterestAndPenaltyPaymentsMade
## 0.000000000 0.000000000
## PrincipalWriteOffs InterestAndPenaltyWriteOffs
## 57.513878428 57.513878428
## PrincipalBalance InterestAndPenaltyBalance
## 0.000000000 2.919072726
## NoOfPreviousLoansBeforeLoan AmountOfPreviousLoansBeforeLoan
## 0.005021341 0.005021341
## PreviousRepaymentsBeforeLoan PreviousEarlyRepaymentsBefoleLoan
## 34.806817865 67.146483667
## PreviousEarlyRepaymentsCountBeforeLoan GracePeriodStart
## 0.005021341 76.631238318
## GracePeriodEnd NextPaymentDate
## 76.631238318 63.388289118
## NextPaymentNr NrOfScheduledPayments
## 37.349847965 37.349847965
## ReScheduledOn PrincipalDebtServicingCost
## 64.826066337 57.513878428
## InterestAndPenaltyDebtServicingCost ActiveLateLastPaymentCategory
## 57.513878428 54.100482607
Based on our dataset we have many missing value, we have to drop some column that contain more than 20% missing value.
Remove columns with more than 20% NA
p2p_clean <- p2p[, which(colMeans(!is.na(p2p)) > 0.8)]
colMeans(is.na(p2p_clean))*100
## ReportAsOfEOD LoanId
## 0.000000000 0.000000000
## LoanNumber ListedOnUTC
## 0.000000000 0.000000000
## BiddingStartedOn BidsPortfolioManager
## 0.000000000 0.000000000
## BidsApi BidsManual
## 0.000000000 0.000000000
## UserName NewCreditCustomer
## 0.000000000 0.000000000
## LoanApplicationStartedDate LoanDate
## 0.000000000 0.000000000
## FirstPaymentDate MaturityDate_Original
## 0.000000000 0.000000000
## MaturityDate_Last ApplicationSignedHour
## 0.000000000 0.000000000
## ApplicationSignedWeekday VerificationType
## 0.000000000 0.027896337
## LanguageCode Age
## 0.000000000 0.000000000
## Gender Country
## 0.025106703 0.000000000
## AppliedAmount Amount
## 0.000000000 0.000000000
## Interest LoanDuration
## 0.000000000 0.000000000
## MonthlyPayment UseOfLoan
## 3.732529919 0.000000000
## Education MaritalStatus
## 0.027896337 0.027896337
## EmploymentStatus EmploymentDurationCurrentEmployer
## 0.112701202 0.896030351
## OccupationArea HomeOwnershipType
## 0.050771334 0.924484615
## IncomeFromPrincipalEmployer IncomeFromPension
## 0.000000000 0.000000000
## IncomeFromFamilyAllowance IncomeFromSocialWelfare
## 0.000000000 0.000000000
## IncomeFromLeavePay IncomeFromChildSupport
## 0.000000000 0.000000000
## IncomeOther IncomeTotal
## 0.000000000 0.000000000
## ExistingLiabilities LiabilitiesTotal
## 0.000000000 0.000000000
## RefinanceLiabilities DebtToIncome
## 0.000000000 0.027896337
## FreeCash MonthlyPaymentDay
## 0.027896337 0.000000000
## ActiveScheduleFirstPaymentReached PlannedInterestTillDate
## 0.000000000 1.101347393
## LastPaymentOn ExpectedLoss
## 3.752057355 1.472368678
## LossGivenDefault ExpectedReturn
## 1.472368678 1.472368678
## ProbabilityOfDefault PrincipalOverdueBySchedule
## 1.472368678 8.281306664
## ModelVersion Rating
## 1.472368678 1.524813792
## Status Restructured
## 0.000000000 0.000000000
## CreditScoreEsMicroL PrincipalPaymentsMade
## 16.907412057 0.000000000
## InterestAndPenaltyPaymentsMade PrincipalBalance
## 0.000000000 0.000000000
## InterestAndPenaltyBalance NoOfPreviousLoansBeforeLoan
## 2.919072726 0.005021341
## AmountOfPreviousLoansBeforeLoan PreviousEarlyRepaymentsCountBeforeLoan
## 0.005021341 0.005021341
Change dataset type and drop unuseful column
p2p_clean <- p2p_clean %>%
na.omit() %>%
mutate_if(is.character, as.factor) %>%
select(-c(UserName, LoanId, LoanNumber)) %>%
mutate(Gender = case_when(Gender == 0 ~ "Male",
Gender == 1 ~ "Female",
Gender == 2 ~ "Undefined"),
MaritalStatus = case_when(
MaritalStatus == 1 ~ "Married",
MaritalStatus == 2 ~ "Cohabitant",
MaritalStatus == 3 ~ "Single",
MaritalStatus == 4 ~ "Divorced",
MaritalStatus == 5 ~ "Widow",
MaritalStatus == -1 ~ "Not specified",
MaritalStatus == 0 ~ "Not specified"),
EmploymentStatus = case_when(
EmploymentStatus == 1 ~ "Unemployed",
EmploymentStatus == 2 ~ "Partially employed",
EmploymentStatus == 3 ~ "Fully employed",
EmploymentStatus == 4 ~ "Self-employed",
EmploymentStatus == 5 ~ "Entreprenuer",
EmploymentStatus == 6 ~ "Retiree",
EmploymentStatus == -1 ~ "Not specified",
EmploymentStatus == 0 ~ "Not specified"),
Education = case_when(
Education == 1 ~ "Primary education",
Education == 2 ~ "Basic education",
Education == 3 ~ "Vocational education",
Education == 4 ~ "Secondary education",
Education == 5 ~ "Higher education",
Education == -1 ~ "Unknown",
Education == 0 ~ "Unkown"),
VerificationType = case_when(
VerificationType == 0 ~ "Not set",
VerificationType == 1 ~ "Income unverified",
VerificationType == 2 ~ "Income unverified, cross-referenced by phone",
VerificationType == 3 ~ "Income verified",
VerificationType == 4 ~ "Income and expenses verified"),
HomeOwnershipType = case_when(
HomeOwnershipType == 0 ~ "Homeless",
HomeOwnershipType == 1 ~ "Owner",
HomeOwnershipType == 2 ~ "Living with parents",
HomeOwnershipType == 3 ~ "Tenant, pre-furnished property",
HomeOwnershipType == 4 ~ "Tenant, unfurnished property",
HomeOwnershipType == 5 ~ "Council house",
HomeOwnershipType == 6 ~ "Joint tenant",
HomeOwnershipType == 7 ~ "Joint ownership",
HomeOwnershipType == 8 ~ "Mortgage",
HomeOwnershipType == 9 ~ "Owner with encumbrance",
HomeOwnershipType == 10 ~ "Other"),
Restructured = case_when(
Restructured == "True" ~ "Yes",
Restructured == "False" ~ "No"),
NewCreditCustomer = case_when(
NewCreditCustomer == "True" ~ "New Credit Customer",
NewCreditCustomer == "False" ~ "Existing Credit Customer"),
LanguageCode = case_when(
LanguageCode == 1 ~ "Estonian",
LanguageCode == 2 ~ "English",
LanguageCode == 3 ~ "Russian",
LanguageCode == 4 ~ "Finnish",
LanguageCode == 5 ~ "German",
LanguageCode == 6 ~ "Spanish",
LanguageCode == 9 ~ "Slovakian",
LanguageCode == 7 ~ "Others",
LanguageCode == 10 ~ "Others",
LanguageCode == 13 ~ "Others",
LanguageCode == 15 ~ "Others",
LanguageCode == 21 ~ "Others",
LanguageCode == 22 ~ "Others"),
UseOfLoan = case_when(
UseOfLoan == -1 ~ "Not Set",
UseOfLoan == 0 ~ "Loan Consolidation",
UseOfLoan == 1 ~ "Real Estate",
UseOfLoan == 2 ~ "Home Improvement",
UseOfLoan == 3 ~ "Business",
UseOfLoan == 4 ~ "Education",
UseOfLoan == 5 ~ "Travel",
UseOfLoan == 6 ~ "Vehicle",
UseOfLoan == 7 ~ "Other",
UseOfLoan == 8 ~ "Health",
UseOfLoan == 101 ~ "Working capital financing",
UseOfLoan == 102 ~ "Purchase of machinery equipment",
UseOfLoan == 104 ~ "Accounts receivable financing",
UseOfLoan == 106 ~ "Construction finance",
UseOfLoan == 107 ~ "Acquisition of stocks",
UseOfLoan == 108 ~ "Acquisition of real estate",
UseOfLoan == 110 ~ "Other Business"),
OccupationArea = case_when(
OccupationArea == 1 ~ "Other",
OccupationArea == 2 ~ "Mining",
OccupationArea == 3 ~ "Processing",
OccupationArea == 4 ~ "Energy",
OccupationArea == 5 ~ "Utilities",
OccupationArea == 6 ~ "Construction",
OccupationArea == 7 ~ "Retail and Wholesale",
OccupationArea == 8 ~ "Transport and Warehousing",
OccupationArea == 9 ~ "Hospitality and Catering",
OccupationArea == 10 ~ "Info and Telecom",
OccupationArea == 11 ~ "Finance and Insurance",
OccupationArea == 12 ~ "Real Estate",
OccupationArea == 13 ~ "Research",
OccupationArea == 14 ~ "Administrative",
OccupationArea == 15 ~ "Civil Service and Military",
OccupationArea == 16 ~ "Education",
OccupationArea == 17 ~ "Healthcare and Social Help",
OccupationArea == 18 ~ "Art and Entertainment",
OccupationArea == 19 ~ "Argiculture Forestry and Fishing",
OccupationArea == -1 ~ "Not specified",
OccupationArea == 0 ~ "Not specified"),
Status = case_when(
Status == "Repaid" ~ "Repaid",
Status == "Late" ~ "Default"))%>%
mutate(ReportAsOfEOD = ymd(ReportAsOfEOD),
LoanApplicationStartedDate = ymd_hms(LoanApplicationStartedDate),
BiddingStartedOn = ymd_hms(BiddingStartedOn),
ListedOnUTC = ymd_hms(ListedOnUTC),
LoanDate = ymd(LoanDate),
FirstPaymentDate = ymd(FirstPaymentDate),
MaturityDate_Original = ymd(MaturityDate_Original),
MaturityDate_Last = ymd(MaturityDate_Last),
LastPaymentOn = ymd(LastPaymentOn),
Gender = as.factor(Gender),
MaritalStatus = as.factor(MaritalStatus),
EmploymentStatus = as.factor(EmploymentStatus),
Education = as.factor(Education),
VerificationType = as.factor(VerificationType),
HomeOwnershipType = as.factor(HomeOwnershipType),
Restructured = as.factor(Restructured),
NewCreditCustomer = as.factor(NewCreditCustomer),
LanguageCode = as.factor(LanguageCode),
UseOfLoan = as.factor(UseOfLoan),
OccupationArea = as.factor(OccupationArea),
Status = as.factor(Status))%>%
filter(LoanDate >= "2015-01-01",
Status != "Current",
Gender != "Undefined",
Education != "Unknown",
NewCreditCustomer != "Existing Credit Customer") %>%
droplevels("Current", "Undefined", "Unknown")
Create Near Zero Var Function
n0_var <- nearZeroVar(p2p_clean)
p2p_novar <- p2p_clean[, -n0_var]
dim(p2p_novar)
## [1] 40900 45
colSums(is.na(p2p_novar))
## ListedOnUTC BiddingStartedOn
## 0 0
## BidsPortfolioManager BidsApi
## 0 0
## BidsManual LoanApplicationStartedDate
## 0 0
## LoanDate FirstPaymentDate
## 0 0
## MaturityDate_Original MaturityDate_Last
## 0 0
## ApplicationSignedHour ApplicationSignedWeekday
## 0 0
## VerificationType LanguageCode
## 0 0
## Age Gender
## 0 0
## Country AppliedAmount
## 0 0
## Amount Interest
## 0 0
## LoanDuration MonthlyPayment
## 0 0
## Education EmploymentDurationCurrentEmployer
## 0 0
## HomeOwnershipType IncomeTotal
## 0 0
## ExistingLiabilities LiabilitiesTotal
## 0 0
## MonthlyPaymentDay PlannedInterestTillDate
## 0 0
## LastPaymentOn ExpectedLoss
## 0 0
## LossGivenDefault ExpectedReturn
## 0 0
## ProbabilityOfDefault PrincipalOverdueBySchedule
## 0 0
## ModelVersion Rating
## 0 0
## Status Restructured
## 0 0
## PrincipalPaymentsMade InterestAndPenaltyPaymentsMade
## 0 0
## PrincipalBalance InterestAndPenaltyBalance
## 0 0
## NoOfPreviousLoansBeforeLoan
## 0
p2p_novar
Drop unuseful columns
p2p_novar_clean2 <- p2p_novar %>%
select(-c(ListedOnUTC, BiddingStartedOn, BidsApi, LoanApplicationStartedDate, LoanDate, FirstPaymentDate, MaturityDate_Original, ApplicationSignedHour,ApplicationSignedWeekday, MaturityDate_Last, LastPaymentOn, ModelVersion, PrincipalBalance, InterestAndPenaltyBalance, PrincipalOverdueBySchedule, BidsPortfolioManager, BidsManual, LanguageCode, Country, ProbabilityOfDefault, NoOfPreviousLoansBeforeLoan, ExpectedReturn, ExpectedLoss, LossGivenDefault, PrincipalPaymentsMade))
p2p_novar_clean2
Create summary dataset
summary(p2p_novar_clean2)
## VerificationType Age Gender
## Income and expenses verified:28445 Min. :18.00 Female:13328
## Income unverified :11695 1st Qu.:30.00 Male :27572
## Income verified : 760 Median :39.00
## Mean :40.77
## 3rd Qu.:51.00
## Max. :70.00
##
## AppliedAmount Amount Interest LoanDuration
## Min. : 500 Min. : 112 Min. : 7.26 Min. : 3.00
## 1st Qu.: 1275 1st Qu.: 1100 1st Qu.: 29.66 1st Qu.:36.00
## Median : 2550 Median : 2230 Median : 36.95 Median :60.00
## Mean : 3082 Mean : 2917 Mean : 42.09 Mean :49.26
## 3rd Qu.: 4250 3rd Qu.: 4250 3rd Qu.: 57.09 3rd Qu.:60.00
## Max. :10632 Max. :10632 Max. :264.31 Max. :60.00
##
## MonthlyPayment Education
## Min. : 0.00 Basic education : 499
## 1st Qu.: 64.81 Higher education : 9595
## Median : 118.92 Primary education : 5896
## Mean : 142.57 Secondary education :12331
## 3rd Qu.: 184.45 Vocational education:12579
## Max. :2058.32
##
## EmploymentDurationCurrentEmployer HomeOwnershipType
## MoreThan5Years:14591 Owner :13578
## UpTo5Years :10073 Tenant, pre-furnished property:12373
## UpTo1Year : 8169 Living with parents : 5861
## Retiree : 3949 Mortgage : 4324
## Other : 2812 Other : 3701
## UpTo2Years : 572 Tenant, unfurnished property : 443
## (Other) : 734 (Other) : 620
## IncomeTotal ExistingLiabilities LiabilitiesTotal MonthlyPaymentDay
## Min. : 0.2 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1000.0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.00
## Median : 1430.0 Median : 1.000 Median : 74.72 Median :10.00
## Mean : 1915.7 Mean : 1.224 Mean : 267.15 Mean :10.45
## 3rd Qu.: 2100.0 3rd Qu.: 2.000 3rd Qu.: 350.00 3rd Qu.:16.00
## Max. :1012019.0 Max. :32.000 Max. :250100.00 Max. :27.00
##
## PlannedInterestTillDate Rating Status Restructured
## Min. : 0.0 E :11805 Default:25983 No :30677
## 1st Qu.: 345.0 F :10116 Repaid :14917 Yes:10223
## Median : 705.1 D : 7816
## Mean : 1373.1 C : 4559
## 3rd Qu.: 1957.7 HR : 3765
## Max. :10632.0 B : 1529
## (Other): 1310
## InterestAndPenaltyPaymentsMade
## Min. : 0.00
## 1st Qu.: 82.25
## Median : 296.11
## Mean : 612.99
## 3rd Qu.: 779.02
## Max. :12902.48
##
We have to split our data into 80% train data and 20% test data.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(120)
index <- sample(nrow(p2p_novar_clean2), nrow(p2p_novar_clean2)*0.8)
p2p_train2 <- p2p_novar_clean2[index,]
p2p_test2 <- p2p_novar_clean2[-index,]
Check head train dataset
head(p2p_train2)
prop.table(table(p2p_novar_clean2$Status))
##
## Default Repaid
## 0.6352812 0.3647188
Let’s see status loan application.
plot5<- ggplot(p2p_novar) +
aes(x = Status, fill = Status) +
geom_bar() +
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
labs(title = "Loan Status") +
theme_economist()
ggplotly(plot5)
We can see from graphic above our repaid status is default. That means this p2p lending screening is not good enough to select worthy borrower.
We can make loan amount graphic to look at how good our funding distribution.
plot0 <- ggplot(p2p_novar) +
aes(x = Amount, colour = Amount) +
geom_histogram(bins = 20L, fill = "purple") +
scale_color_distiller(palette = "PuBu",
direction = -1) +
labs(title = "Funding Distribution") +
theme_economist()
ggplotly(plot0)
From graphic above we can see most of our borrower get amount under $3000.
See our borrower age distribution
plot1 <- ggplot(p2p_novar) +
aes(x = Age, colour = Age) +
geom_histogram(bins = 30L, fill = "Purple") +
scale_color_distiller(palette = "PuBu",
direction = -1) +
labs(title = "Age Borrower Distribution") +
theme_economist()
ggplotly(plot1)
Most of our borrower have age between 20 - 40 years old. Peer to peer lending company usually like this age because they have power to earn money from their job.
We make boxplot from our status vs age.
plot2 <- ggplot(p2p_novar) +
aes(x = Status, y = Age, fill = Status) +
geom_boxplot(shape = "circle") +
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
labs(title = "Status VS Age") +
theme_economist()
ggplotly(plot2)
Based our boxplot we can conclude our default borrower have median at 41 years old, it can be high risk if p2p lending give funding for borrower above 41 years old. Safe zone for p2p lending company is 38 years old or below.
Next we see default status and gender.
plot3 <- ggplot(p2p_novar) +
aes(x = Gender, fill = Status) +
geom_bar() +
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF")) +
labs(title = "Status Vs Gender") +
theme_economist()
ggplotly(plot3)
Our data shows us male borrower tend to be default more than female borrower.
Last we see our verification status.
plot4 <- ggplot(p2p_novar) +
aes(x = VerificationType, fill = Status) +
geom_bar() +
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
labs(title = "Verification Type Status") +
theme_economist()
ggplotly(plot4)
We can see income and expenses verified is the highest default rate.
#set.seed(120)
#ctrl <- trainControl(method = "repeatedcv",number = 3,repeats = 2)
#model_forest <- caret::train(Status ~.,data = p2p_train2, method = "rf",trControl = ctrl)
#saveRDS(model_forest, "model_forest_fix.RDS")
model_rf <- readRDS("model_forest_fix.RDS")
model_rf
## Random Forest
##
## 32720 samples
## 19 predictor
## 2 classes: 'Default', 'Repaid'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times)
## Summary of sample sizes: 21813, 21813, 21814, 21813, 21814, 21813, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7857122 0.4844938
## 23 0.9173594 0.8195725
## 44 0.9138753 0.8122693
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 23.
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 23
##
## OOB estimate of error rate: 7.9%
## Confusion matrix:
## Default Repaid class.error
## Default 19812 1004 0.04823213
## Repaid 1581 10323 0.13281250
Prediction Random Forest Model
pred_rf <- predict(model_rf, p2p_test2, type = "raw")
Random Forest Confussion Matrix
confusionMatrix(pred_rf,reference = p2p_test2$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default Repaid
## Default 4919 416
## Repaid 248 2597
##
## Accuracy : 0.9188
## 95% CI : (0.9127, 0.9247)
## No Information Rate : 0.6317
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8235
##
## Mcnemar's Test P-Value : 0.0000000000912
##
## Sensitivity : 0.9520
## Specificity : 0.8619
## Pos Pred Value : 0.9220
## Neg Pred Value : 0.9128
## Prevalence : 0.6317
## Detection Rate : 0.6013
## Detection Prevalence : 0.6522
## Balanced Accuracy : 0.9070
##
## 'Positive' Class : Default
##
Variable Important
varImp(model_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 44)
##
## Overall
## PlannedInterestTillDate 100.0000
## AppliedAmount 21.1524
## MonthlyPayment 17.4886
## Amount 15.5052
## InterestAndPenaltyPaymentsMade 12.3246
## Interest 10.7884
## LoanDuration 10.0518
## RestructuredYes 8.6764
## IncomeTotal 6.6301
## Age 6.2733
## LiabilitiesTotal 5.9911
## MonthlyPaymentDay 4.5648
## RatingHR 3.9405
## ExistingLiabilities 2.3519
## VerificationTypeIncome verified 1.6063
## VerificationTypeIncome unverified 0.8964
## EducationHigher education 0.8229
## HomeOwnershipTypeOwner 0.8028
## GenderMale 0.7694
## EducationSecondary education 0.7514
plot(varImp(model_rf))
set.seed(120)
model_dt <-ctree(Status ~.,p2p_train2)
Prediction Decision Tree Model
pred_dt <- predict(model_dt, p2p_test2, type = "response")
Comparing Model
confusionMatrix(pred_dt, p2p_test2$Status, positive = "Default") #Decision Tree Model
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default Repaid
## Default 4795 469
## Repaid 372 2544
##
## Accuracy : 0.8972
## 95% CI : (0.8904, 0.9037)
## No Information Rate : 0.6317
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.7776
##
## Mcnemar's Test P-Value : 0.0009318
##
## Sensitivity : 0.9280
## Specificity : 0.8443
## Pos Pred Value : 0.9109
## Neg Pred Value : 0.8724
## Prevalence : 0.6317
## Detection Rate : 0.5862
## Detection Prevalence : 0.6435
## Balanced Accuracy : 0.8862
##
## 'Positive' Class : Default
##
confusionMatrix(pred_rf,reference = p2p_test2$Status ) #Random Forest Model
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default Repaid
## Default 4919 416
## Repaid 248 2597
##
## Accuracy : 0.9188
## 95% CI : (0.9127, 0.9247)
## No Information Rate : 0.6317
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8235
##
## Mcnemar's Test P-Value : 0.0000000000912
##
## Sensitivity : 0.9520
## Specificity : 0.8619
## Pos Pred Value : 0.9220
## Neg Pred Value : 0.9128
## Prevalence : 0.6317
## Detection Rate : 0.6013
## Detection Prevalence : 0.6522
## Balanced Accuracy : 0.9070
##
## 'Positive' Class : Default
##
Based on our result Random Forest model is best with approximately Sensitivity 95%. So we will embed our random forest model to our dashboard.