1 Introduction

Indonesia has the world’s 16th largest economy and is the world’s fourth most populous country with 274 million people, out of these 84% below the age of 54. With Internet penetration standing at 67% and financial inclusion remaining a key challenge, Indonesia offers a fertile ground for fintech innovation. Indonesia’s fast-growing fintech ecosystem has been largely driven by a proactive government that has introduced rules in areas including peer-to-peer (P2P) lending, digital payments, and most recently open banking, in hopes to encourage innovation and improve financial inclusion. This favorable landscape has attracted the attention of both local and international investors, many of which are betting big on the prospects of digital financial services in the country.

2 Background

According to the report, the Indonesian fintech startups landscape is dominated by companies in the online lending space, which account for half of all fintech startups in the country. Online lending is followed by payments with 73 companies (23%), blockchain and cryptocurrencies with 26 companies (8%), and investment/personal finance with 24 companies (7%). Other segments represented include insurtech (15 companies), crowdfunding (9 companies), point-of-sale (POS) services (7 companies), and financial comparison (7 companies).

3 Problem Statement and Project Idea

Our fintech notably from lending sector is huge, President Jokowi said for unbankable borrower have more than Rp1000 trillion prospect but our fintech only can absorp not more than 20% approximately Rp 194,1 trillion. It likes double edge sword, if we can not maintain our p2p lending performance it could be massive failure too. Consequently, We make this project to boost fintech in Indonesia especially from lending sector to reduce the risk factor from default borrower. Our project will do prediction for borrower profile and make prediction whether our borrower will be default or not.

4 Problem Scope

In this project we will use dataset bondora p2p lending. This dataset contain information such as :

  • Age :The age of the borrower when signing the loan application
  • Amount :Amount the borrower received on the Primary Market. This is the principal balance of your purchase from Secondary Market
  • AmountOfPreviousLoansBeforeLoan :Value of previous loans
  • AppliedAmount :The amount borrower applied for originally
  • AuctionBidNumber :Unique bid number which is accompanied by Auction number
  • AuctionId :A unique number given to all auctions
  • CurrentDebtDaysPrimary :How long the loan has been in Principal Debt
  • CurrentDebtDaysSecondary :How long the loan has been in Interest Debt
  • DateOfBirth :The date of the borrower’s birth
  • DebtOccuredOn :The date when Principal Debt occurred
  • DebtOccuredOnForSecondary :The date when Interest Debt occurred
  • DebtToIncome :Ratio of borrower’s monthly gross income that goes toward paying loans
  • DefaultDate :The date when loan went into defaulted state and collection process was started

Source dataset : https://www.kaggle.com/sid321axn/bondora-peer-to-peer-lending-loan-data

5 Output

Outpout from this project is dashboard that will show which are the indicator to accepting borrower and default rate from each borrower.

Business Impact : Our business impact has wide scope, not only for p2p lending company but it also can from lender side even borrower can get impact too. Let we see one by one, first from company view our dashboard can be useful for their business as tool for make good TKB90, TKB90 is a measurement of the operators to facilitate the settlement of lending obligation within 90 days since the due date. Second, we see from lender prespective, becoming a lender is not easy, you have to choose which one is worthy borrower that has low percentage of defaulting, with our dashboard lender can easily choose low risk borrower. Third, from borrower view this dashboard can be useful for them, so they can improve their credit score by knowing what are important indicator in calculation.

6 Exploratory Data Analysis (EDA)

Import Library

library(dplyr)
library(tidyverse)
library(ggplot2)
library(caret)
library(e1071)
library(randomForest)
library(lubridate)
library(partykit)
library(plotly)
library(ggthemes)
options(scipen = 999)

Read Data

p2p <- read.csv("LoanData_Bondora.csv",sep=",",header=T,na.strings=c("",NA))
glimpse(p2p)
## Rows: 179,235
## Columns: 112
## $ ReportAsOfEOD                          <chr> "2021-07-20", "2021-07-20", "20…
## $ LoanId                                 <chr> "66AE108B-532B-4BB3-BAB7-0019A4…
## $ LoanNumber                             <int> 483449, 378148, 451831, 349381,…
## $ ListedOnUTC                            <chr> "2016-03-23 16:07:19", "2015-06…
## $ BiddingStartedOn                       <chr> "2016-03-23 16:07:19", "2015-06…
## $ BidsPortfolioManager                   <int> 970, 1295, 2700, 1115, 305, 600…
## $ BidsApi                                <int> 1150, 0, 565, 0, 0, 0, 0, 0, 0,…
## $ BidsManual                             <dbl> 5, 1705, 5835, 385, 785, 175, 0…
## $ UserName                               <chr> "BO965519", "BOA9K172A", "BO797…
## $ NewCreditCustomer                      <chr> "False", "False", "True", "True…
## $ LoanApplicationStartedDate             <chr> "2016-03-17 12:39:22", "2015-06…
## $ LoanDate                               <chr> "2016-03-23", "2015-06-25", "20…
## $ ContractEndDate                        <chr> "2020-06-26", NA, "2019-10-24",…
## $ FirstPaymentDate                       <chr> "2016-05-12", "2015-08-17", "20…
## $ MaturityDate_Original                  <chr> "2021-04-12", "2020-07-17", "20…
## $ MaturityDate_Last                      <chr> "2020-06-26", "2020-07-17", "20…
## $ ApplicationSignedHour                  <int> 17, 11, 22, 15, 20, 0, 15, 16, …
## $ ApplicationSignedWeekday               <int> 4, 5, 3, 3, 3, 5, 2, 4, 2, 4, 3…
## $ VerificationType                       <int> 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 4…
## $ LanguageCode                           <int> 1, 1, 1, 6, 6, 6, 1, 6, 4, 1, 1…
## $ Age                                    <int> 53, 50, 44, 42, 34, 31, 22, 47,…
## $ DateOfBirth                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Gender                                 <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0…
## $ Country                                <chr> "EE", "EE", "EE", "ES", "ES", "…
## $ AppliedAmount                          <dbl> 2125, 3000, 10630, 1500, 1595, …
## $ Amount                                 <dbl> 2125, 3000, 9100, 1500, 1090, 7…
## $ Interest                               <dbl> 20.97, 17.12, 13.67, 40.40, 68.…
## $ LoanDuration                           <int> 60, 60, 60, 60, 48, 60, 60, 12,…
## $ MonthlyPayment                         <dbl> 62.05, 84.75, 268.57, 63.53, 10…
## $ County                                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ City                                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ UseOfLoan                              <int> 2, 3, 3, 2, 7, 7, 8, 6, 2, 0, 7…
## $ Education                              <int> 4, 5, 4, 2, 4, 4, 2, 3, 4, 4, 4…
## $ MaritalStatus                          <int> 2, 2, 4, 1, 4, 1, 3, 2, 1, 4, 3…
## $ NrOfDependants                         <chr> "0", "0", "1", "0", "2", "0", "…
## $ EmploymentStatus                       <int> 6, 5, 5, 3, 6, 3, 3, 6, 3, 3, 3…
## $ EmploymentDurationCurrentEmployer      <chr> "MoreThan5Years", "MoreThan5Yea…
## $ EmploymentPosition                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ WorkExperience                         <chr> "15To25Years", "MoreThan25Years…
## $ OccupationArea                         <int> 1, 7, 8, 1, 1, 7, 9, 1, 11, 8, …
## $ HomeOwnershipType                      <int> 1, 1, 8, 2, 3, 4, 4, 1, 1, 6, 3…
## $ IncomeFromPrincipalEmployer            <dbl> 0, 900, 600, 863, 0, 970, 745, …
## $ IncomeFromPension                      <dbl> 301, 0, 0, 0, 697, 0, 0, 1126, …
## $ IncomeFromFamilyAllowance              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeFromSocialWelfare                <dbl> 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IncomeFromLeavePay                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeFromChildSupport                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IncomeOther                            <dbl> 0, 0, 600, 0, 0, 0, 0, 0, 0, 0,…
## $ IncomeTotal                            <dbl> 354, 900, 1200, 863, 697, 970, …
## $ ExistingLiabilities                    <int> 8, 4, 7, 1, 5, 5, 1, 2, 8, 9, 7…
## $ LiabilitiesTotal                       <dbl> 485.09, 736.45, 905.00, 350.00,…
## $ RefinanceLiabilities                   <int> 6, 0, 3, 0, 2, 2, 0, 0, 0, 3, 1…
## $ DebtToIncome                           <dbl> 26.29, 30.58, 26.71, 7.36, 36.0…
## $ FreeCash                               <dbl> 10.92, 78.80, 349.43, 449.47, 9…
## $ MonthlyPaymentDay                      <int> 12, 17, 20, 1, 1, 1, 15, 25, 20…
## $ ActiveScheduleFirstPaymentReached      <chr> "True", "True", "True", "True",…
## $ PlannedPrincipalTillDate               <dbl> 630.22, 1333.51, 3348.50, 573.3…
## $ PlannedInterestTillDate                <dbl> 1251.98, 3000.00, 9100.00, 1500…
## $ LastPaymentOn                          <chr> "2021-06-16", "2019-06-19", "20…
## $ CurrentDebtDaysPrimary                 <int> 552, 1918, 1368, 2057, 1995, 20…
## $ DebtOccuredOn                          <chr> "2020-01-14", "2016-04-18", "20…
## $ CurrentDebtDaysSecondary               <int> 630, 1979, 1428, 2057, 1995, 20…
## $ DebtOccuredOnForSecondary              <chr> "2019-10-28", "2016-02-17", "20…
## $ ExpectedLoss                           <dbl> 0.06851187, 0.03079913, 0.02317…
## $ LossGivenDefault                       <dbl> 0.58, 0.65, 0.58, 0.90, 0.75, 0…
## $ ExpectedReturn                         <dbl> 0.1411449, 0.1404356, 0.1134841…
## $ ProbabilityOfDefault                   <dbl> 0.12221594, 0.03644867, 0.04134…
## $ DefaultDate                            <chr> "2020-01-14", "2016-06-02", "20…
## $ PrincipalOverdueBySchedule             <dbl> 1155.84, 2436.41, 0.00, 1035.27…
## $ PlannedPrincipalPostDefault            <dbl> 1251.98, 2658.82, 6456.37, 1406…
## $ PlannedInterestPostDefault             <dbl> 77.68, 1078.96, 1537.37, 1520.1…
## $ EAD1                                   <dbl> 1251.98, 2730.84, 6723.01, 1434…
## $ EAD2                                   <dbl> 64.07, 2370.77, 5014.54, 1078.7…
## $ PrincipalRecovery                      <dbl> 96.14, 294.43, 4160.01, 399.41,…
## $ InterestRecovery                       <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, NA,…
## $ RecoveryStage                          <int> 2, 2, 2, 1, 1, 2, 2, NA, NA, NA…
## $ StageActiveSince                       <chr> "2020-03-03 09:27:48.493000000"…
## $ ModelVersion                           <int> 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1…
## $ Rating                                 <chr> "C", "B", "A", "F", "HR", "HR",…
## $ EL_V0                                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Rating_V0                              <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ EL_V1                                  <dbl> NA, 0.03079913, NA, 0.22080971,…
## $ Rating_V1                              <chr> NA, "B", NA, "F", NA, "HR", NA,…
## $ Rating_V2                              <chr> "C", "B", "A", "HR", "HR", "HR"…
## $ Status                                 <chr> "Late", "Late", "Repaid", "Late…
## $ Restructured                           <chr> "False", "False", "False", "Fal…
## $ ActiveLateCategory                     <chr> "180+", "180+", "180+", "180+",…
## $ WorseLateCategory                      <chr> "180+", "180+", "180+", "180+",…
## $ CreditScoreEsMicroL                    <chr> NA, NA, NA, "M3", "M5", "M5", N…
## $ CreditScoreEsEquifaxRisk               <chr> NA, NA, NA, "B", "C", "C", NA, …
## $ CreditScoreFiAsiakasTietoRiskGrade     <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CreditScoreEeMini                      <int> 1000, 1000, 1000, NA, NA, NA, 8…
## $ PrincipalPaymentsMade                  <dbl> 969.16, 563.59, 6537.00, 464.73…
## $ InterestAndPenaltyPaymentsMade         <dbl> 1187.91, 360.07, 1708.47, 355.9…
## $ PrincipalWriteOffs                     <dbl> 0.00, 0.00, 2303.33, 0.00, 0.00…
## $ InterestAndPenaltyWriteOffs            <dbl> 0.00, 0.00, 0.88, 0.00, 0.00, 0…
## $ PrincipalBalance                       <dbl> 1155.84, 2436.41, 0.00, 1035.27…
## $ InterestAndPenaltyBalance              <dbl> 433.60, 2291.82, 0.00, 2833.83,…
## $ NoOfPreviousLoansBeforeLoan            <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2…
## $ AmountOfPreviousLoansBeforeLoan        <dbl> 500, 1800, 0, 0, 0, 0, 0, 0, 0,…
## $ PreviousRepaymentsBeforeLoan           <dbl> 590.95, 445.26, 0.00, 0.00, 0.0…
## $ PreviousEarlyRepaymentsBefoleLoan      <dbl> 0, 3000, 0, 1500, 0, 0, 0, 1500…
## $ PreviousEarlyRepaymentsCountBeforeLoan <int> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1…
## $ GracePeriodStart                       <chr> "2019-10-28", NA, NA, NA, NA, N…
## $ GracePeriodEnd                         <chr> "2020-01-27", NA, NA, NA, NA, N…
## $ NextPaymentDate                        <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NextPaymentNr                          <int> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NrOfScheduledPayments                  <int> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ReScheduledOn                          <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PrincipalDebtServicingCost             <dbl> 0.00, 0.00, 259.67, 0.00, 0.00,…
## $ InterestAndPenaltyDebtServicingCost    <dbl> 51.73, 47.08, 1659.60, 215.05, …
## $ ActiveLateLastPaymentCategory          <chr> "31-60", "180+", "180+", "180+"…

Check dataset dimension

dim(p2p)
## [1] 179235    112

Check missing value mean

colMeans(is.na(p2p))*100
##                          ReportAsOfEOD                                 LoanId 
##                            0.000000000                            0.000000000 
##                             LoanNumber                            ListedOnUTC 
##                            0.000000000                            0.000000000 
##                       BiddingStartedOn                   BidsPortfolioManager 
##                            0.000000000                            0.000000000 
##                                BidsApi                             BidsManual 
##                            0.000000000                            0.000000000 
##                               UserName                      NewCreditCustomer 
##                            0.000000000                            0.000000000 
##             LoanApplicationStartedDate                               LoanDate 
##                            0.000000000                            0.000000000 
##                        ContractEndDate                       FirstPaymentDate 
##                           42.535777052                            0.000000000 
##                  MaturityDate_Original                      MaturityDate_Last 
##                            0.000000000                            0.000000000 
##                  ApplicationSignedHour               ApplicationSignedWeekday 
##                            0.000000000                            0.000000000 
##                       VerificationType                           LanguageCode 
##                            0.027896337                            0.000000000 
##                                    Age                            DateOfBirth 
##                            0.000000000                          100.000000000 
##                                 Gender                                Country 
##                            0.025106703                            0.000000000 
##                          AppliedAmount                                 Amount 
##                            0.000000000                            0.000000000 
##                               Interest                           LoanDuration 
##                            0.000000000                            0.000000000 
##                         MonthlyPayment                                 County 
##                            3.732529919                          100.000000000 
##                                   City                              UseOfLoan 
##                          100.000000000                            0.000000000 
##                              Education                          MaritalStatus 
##                            0.027896337                            0.027896337 
##                         NrOfDependants                       EmploymentStatus 
##                           80.137807906                            0.112701202 
##      EmploymentDurationCurrentEmployer                     EmploymentPosition 
##                            0.896030351                          100.000000000 
##                         WorkExperience                         OccupationArea 
##                           79.623399448                            0.050771334 
##                      HomeOwnershipType            IncomeFromPrincipalEmployer 
##                            0.924484615                            0.000000000 
##                      IncomeFromPension              IncomeFromFamilyAllowance 
##                            0.000000000                            0.000000000 
##                IncomeFromSocialWelfare                     IncomeFromLeavePay 
##                            0.000000000                            0.000000000 
##                 IncomeFromChildSupport                            IncomeOther 
##                            0.000000000                            0.000000000 
##                            IncomeTotal                    ExistingLiabilities 
##                            0.000000000                            0.000000000 
##                       LiabilitiesTotal                   RefinanceLiabilities 
##                            0.000000000                            0.000000000 
##                           DebtToIncome                               FreeCash 
##                            0.027896337                            0.027896337 
##                      MonthlyPaymentDay      ActiveScheduleFirstPaymentReached 
##                            0.000000000                            0.000000000 
##               PlannedPrincipalTillDate                PlannedInterestTillDate 
##                           69.613077803                            1.101347393 
##                          LastPaymentOn                 CurrentDebtDaysPrimary 
##                            3.752057355                           57.472591849 
##                          DebtOccuredOn               CurrentDebtDaysSecondary 
##                           57.472591849                           54.100482607 
##              DebtOccuredOnForSecondary                           ExpectedLoss 
##                           54.100482607                            1.472368678 
##                       LossGivenDefault                         ExpectedReturn 
##                            1.472368678                            1.472368678 
##                   ProbabilityOfDefault                            DefaultDate 
##                            1.472368678                           60.155103635 
##             PrincipalOverdueBySchedule            PlannedPrincipalPostDefault 
##                            8.281306664                           60.155661562 
##             PlannedInterestPostDefault                                   EAD1 
##                           60.155661562                           60.156219488 
##                                   EAD2                      PrincipalRecovery 
##                           60.156219488                           60.155661562 
##                       InterestRecovery                          RecoveryStage 
##                           60.155661562                           37.545680252 
##                       StageActiveSince                           ModelVersion 
##                           33.747872904                            1.472368678 
##                                 Rating                                  EL_V0 
##                            1.524813792                           97.450832706 
##                              Rating_V0                                  EL_V1 
##                           97.450832706                           92.789354758 
##                              Rating_V1                              Rating_V2 
##                           92.789354758                           85.972047870 
##                                 Status                           Restructured 
##                            0.000000000                            0.000000000 
##                     ActiveLateCategory                      WorseLateCategory 
##                           58.164421012                           29.195748598 
##                    CreditScoreEsMicroL               CreditScoreEsEquifaxRisk 
##                           16.907412057                           93.182693112 
##     CreditScoreFiAsiakasTietoRiskGrade                      CreditScoreEeMini 
##                           76.663040143                           41.363572963 
##                  PrincipalPaymentsMade         InterestAndPenaltyPaymentsMade 
##                            0.000000000                            0.000000000 
##                     PrincipalWriteOffs            InterestAndPenaltyWriteOffs 
##                           57.513878428                           57.513878428 
##                       PrincipalBalance              InterestAndPenaltyBalance 
##                            0.000000000                            2.919072726 
##            NoOfPreviousLoansBeforeLoan        AmountOfPreviousLoansBeforeLoan 
##                            0.005021341                            0.005021341 
##           PreviousRepaymentsBeforeLoan      PreviousEarlyRepaymentsBefoleLoan 
##                           34.806817865                           67.146483667 
## PreviousEarlyRepaymentsCountBeforeLoan                       GracePeriodStart 
##                            0.005021341                           76.631238318 
##                         GracePeriodEnd                        NextPaymentDate 
##                           76.631238318                           63.388289118 
##                          NextPaymentNr                  NrOfScheduledPayments 
##                           37.349847965                           37.349847965 
##                          ReScheduledOn             PrincipalDebtServicingCost 
##                           64.826066337                           57.513878428 
##    InterestAndPenaltyDebtServicingCost          ActiveLateLastPaymentCategory 
##                           57.513878428                           54.100482607

Based on our dataset we have many missing value, we have to drop some column that contain more than 20% missing value.

Remove columns with more than 20% NA

p2p_clean <- p2p[, which(colMeans(!is.na(p2p)) > 0.8)]
colMeans(is.na(p2p_clean))*100
##                          ReportAsOfEOD                                 LoanId 
##                            0.000000000                            0.000000000 
##                             LoanNumber                            ListedOnUTC 
##                            0.000000000                            0.000000000 
##                       BiddingStartedOn                   BidsPortfolioManager 
##                            0.000000000                            0.000000000 
##                                BidsApi                             BidsManual 
##                            0.000000000                            0.000000000 
##                               UserName                      NewCreditCustomer 
##                            0.000000000                            0.000000000 
##             LoanApplicationStartedDate                               LoanDate 
##                            0.000000000                            0.000000000 
##                       FirstPaymentDate                  MaturityDate_Original 
##                            0.000000000                            0.000000000 
##                      MaturityDate_Last                  ApplicationSignedHour 
##                            0.000000000                            0.000000000 
##               ApplicationSignedWeekday                       VerificationType 
##                            0.000000000                            0.027896337 
##                           LanguageCode                                    Age 
##                            0.000000000                            0.000000000 
##                                 Gender                                Country 
##                            0.025106703                            0.000000000 
##                          AppliedAmount                                 Amount 
##                            0.000000000                            0.000000000 
##                               Interest                           LoanDuration 
##                            0.000000000                            0.000000000 
##                         MonthlyPayment                              UseOfLoan 
##                            3.732529919                            0.000000000 
##                              Education                          MaritalStatus 
##                            0.027896337                            0.027896337 
##                       EmploymentStatus      EmploymentDurationCurrentEmployer 
##                            0.112701202                            0.896030351 
##                         OccupationArea                      HomeOwnershipType 
##                            0.050771334                            0.924484615 
##            IncomeFromPrincipalEmployer                      IncomeFromPension 
##                            0.000000000                            0.000000000 
##              IncomeFromFamilyAllowance                IncomeFromSocialWelfare 
##                            0.000000000                            0.000000000 
##                     IncomeFromLeavePay                 IncomeFromChildSupport 
##                            0.000000000                            0.000000000 
##                            IncomeOther                            IncomeTotal 
##                            0.000000000                            0.000000000 
##                    ExistingLiabilities                       LiabilitiesTotal 
##                            0.000000000                            0.000000000 
##                   RefinanceLiabilities                           DebtToIncome 
##                            0.000000000                            0.027896337 
##                               FreeCash                      MonthlyPaymentDay 
##                            0.027896337                            0.000000000 
##      ActiveScheduleFirstPaymentReached                PlannedInterestTillDate 
##                            0.000000000                            1.101347393 
##                          LastPaymentOn                           ExpectedLoss 
##                            3.752057355                            1.472368678 
##                       LossGivenDefault                         ExpectedReturn 
##                            1.472368678                            1.472368678 
##                   ProbabilityOfDefault             PrincipalOverdueBySchedule 
##                            1.472368678                            8.281306664 
##                           ModelVersion                                 Rating 
##                            1.472368678                            1.524813792 
##                                 Status                           Restructured 
##                            0.000000000                            0.000000000 
##                    CreditScoreEsMicroL                  PrincipalPaymentsMade 
##                           16.907412057                            0.000000000 
##         InterestAndPenaltyPaymentsMade                       PrincipalBalance 
##                            0.000000000                            0.000000000 
##              InterestAndPenaltyBalance            NoOfPreviousLoansBeforeLoan 
##                            2.919072726                            0.005021341 
##        AmountOfPreviousLoansBeforeLoan PreviousEarlyRepaymentsCountBeforeLoan 
##                            0.005021341                            0.005021341

Change dataset type and drop unuseful column

p2p_clean <- p2p_clean %>%
              na.omit() %>% 
              mutate_if(is.character, as.factor) %>% 
              select(-c(UserName, LoanId, LoanNumber)) %>%
              mutate(Gender = case_when(Gender == 0 ~ "Male",
                                        Gender == 1 ~ "Female",
                                        Gender == 2 ~ "Undefined"),
                     MaritalStatus = case_when(
                       MaritalStatus == 1 ~ "Married",
                       MaritalStatus == 2 ~ "Cohabitant", 
                       MaritalStatus == 3 ~ "Single", 
                       MaritalStatus == 4 ~ "Divorced",
                       MaritalStatus == 5 ~ "Widow",
                       MaritalStatus == -1 ~ "Not specified",
                       MaritalStatus == 0 ~ "Not specified"),
                     EmploymentStatus = case_when(
                       EmploymentStatus == 1 ~ "Unemployed",
                       EmploymentStatus == 2 ~ "Partially employed",
                       EmploymentStatus == 3 ~ "Fully employed",
                       EmploymentStatus == 4 ~ "Self-employed",
                       EmploymentStatus == 5 ~ "Entreprenuer",
                       EmploymentStatus == 6 ~ "Retiree",
                       EmploymentStatus == -1 ~ "Not specified",
                       EmploymentStatus == 0 ~ "Not specified"),
                     Education = case_when(
                       Education == 1 ~ "Primary education",
                       Education == 2 ~ "Basic education",
                       Education == 3 ~ "Vocational education",
                       Education == 4 ~ "Secondary education",
                       Education == 5 ~ "Higher education",
                       Education == -1 ~ "Unknown",
                       Education == 0 ~ "Unkown"),
                     VerificationType = case_when(
                       VerificationType == 0 ~ "Not set",
                       VerificationType == 1 ~ "Income unverified",
                       VerificationType == 2 ~ "Income unverified, cross-referenced by phone",
                       VerificationType == 3 ~ "Income verified",
                       VerificationType == 4 ~ "Income and expenses verified"),
                     HomeOwnershipType = case_when(
                       HomeOwnershipType == 0 ~ "Homeless",
                       HomeOwnershipType == 1 ~ "Owner",
                       HomeOwnershipType == 2 ~ "Living with parents",
                       HomeOwnershipType == 3 ~ "Tenant, pre-furnished property",
                       HomeOwnershipType == 4 ~ "Tenant, unfurnished property",
                       HomeOwnershipType == 5 ~ "Council house",
                       HomeOwnershipType == 6 ~ "Joint tenant",
                       HomeOwnershipType == 7 ~ "Joint ownership",
                       HomeOwnershipType == 8 ~ "Mortgage",
                       HomeOwnershipType == 9 ~ "Owner with encumbrance",
                       HomeOwnershipType == 10 ~ "Other"),
                     Restructured = case_when(
                       Restructured == "True" ~ "Yes",
                       Restructured == "False" ~ "No"),
                     NewCreditCustomer = case_when(
                       NewCreditCustomer == "True" ~ "New Credit Customer",
                       NewCreditCustomer == "False" ~ "Existing Credit Customer"),
                     LanguageCode = case_when(
                       LanguageCode == 1 ~ "Estonian",
                       LanguageCode == 2 ~ "English",
                       LanguageCode == 3 ~ "Russian",
                       LanguageCode == 4 ~ "Finnish",
                       LanguageCode == 5 ~ "German",
                       LanguageCode == 6 ~ "Spanish",
                       LanguageCode == 9 ~ "Slovakian",
                       LanguageCode == 7 ~ "Others",
                       LanguageCode == 10 ~ "Others",
                       LanguageCode == 13 ~ "Others",
                       LanguageCode == 15 ~ "Others",
                       LanguageCode == 21 ~ "Others",
                       LanguageCode == 22 ~ "Others"),
                     UseOfLoan = case_when(
                       UseOfLoan == -1 ~ "Not Set",
                       UseOfLoan == 0 ~ "Loan Consolidation",
                       UseOfLoan == 1 ~ "Real Estate",
                       UseOfLoan == 2 ~ "Home Improvement",
                       UseOfLoan == 3 ~ "Business",
                       UseOfLoan == 4 ~ "Education",
                       UseOfLoan == 5 ~ "Travel",
                       UseOfLoan == 6 ~ "Vehicle",
                       UseOfLoan == 7 ~ "Other",
                       UseOfLoan == 8 ~ "Health",
                       UseOfLoan == 101 ~ "Working capital financing",
                       UseOfLoan == 102 ~ "Purchase of machinery equipment",
                       UseOfLoan == 104 ~ "Accounts receivable financing",
                       UseOfLoan == 106 ~ "Construction finance",
                       UseOfLoan == 107 ~ "Acquisition of stocks",
                       UseOfLoan == 108 ~ "Acquisition of real estate",
                       UseOfLoan == 110 ~ "Other Business"),
                     OccupationArea = case_when(
                       OccupationArea == 1 ~ "Other",
                       OccupationArea == 2 ~ "Mining",
                       OccupationArea == 3 ~ "Processing",
                       OccupationArea == 4 ~ "Energy",
                       OccupationArea == 5 ~ "Utilities",
                       OccupationArea == 6 ~ "Construction",
                       OccupationArea == 7 ~ "Retail and Wholesale",
                       OccupationArea == 8 ~ "Transport and Warehousing",
                       OccupationArea == 9 ~ "Hospitality and Catering",
                       OccupationArea == 10 ~ "Info and Telecom",
                       OccupationArea == 11 ~ "Finance and Insurance",
                       OccupationArea == 12 ~ "Real Estate",
                       OccupationArea == 13 ~ "Research",
                       OccupationArea == 14 ~ "Administrative",
                       OccupationArea == 15 ~ "Civil Service and Military",
                       OccupationArea == 16 ~ "Education",
                       OccupationArea == 17 ~ "Healthcare and Social Help",
                       OccupationArea == 18 ~ "Art and Entertainment",
                       OccupationArea == 19 ~ "Argiculture Forestry and Fishing",
                       OccupationArea == -1 ~ "Not specified",
                       OccupationArea == 0 ~ "Not specified"),
                     Status = case_when(
                       Status == "Repaid" ~ "Repaid",
                       Status == "Late" ~ "Default"))%>%   
              mutate(ReportAsOfEOD = ymd(ReportAsOfEOD),
                     LoanApplicationStartedDate = ymd_hms(LoanApplicationStartedDate),
                     BiddingStartedOn = ymd_hms(BiddingStartedOn),
                     ListedOnUTC = ymd_hms(ListedOnUTC),
                     LoanDate = ymd(LoanDate),
                     FirstPaymentDate = ymd(FirstPaymentDate),
                     MaturityDate_Original = ymd(MaturityDate_Original),
                     MaturityDate_Last = ymd(MaturityDate_Last),
                     LastPaymentOn = ymd(LastPaymentOn),
                     Gender = as.factor(Gender),
                     MaritalStatus = as.factor(MaritalStatus),
                     EmploymentStatus = as.factor(EmploymentStatus),
                     Education = as.factor(Education),
                     VerificationType = as.factor(VerificationType),
                     HomeOwnershipType = as.factor(HomeOwnershipType),
                     Restructured = as.factor(Restructured),
                     NewCreditCustomer = as.factor(NewCreditCustomer),
                     LanguageCode = as.factor(LanguageCode),
                     UseOfLoan = as.factor(UseOfLoan),
                     OccupationArea = as.factor(OccupationArea),
                     Status = as.factor(Status))%>% 
              filter(LoanDate >= "2015-01-01",
                     Status != "Current", 
                     Gender != "Undefined",
                     Education != "Unknown",
                     NewCreditCustomer != "Existing Credit Customer") %>% 
              droplevels("Current", "Undefined", "Unknown")

Create Near Zero Var Function

n0_var <- nearZeroVar(p2p_clean)
p2p_novar <- p2p_clean[, -n0_var]
dim(p2p_novar)
## [1] 40900    45
colSums(is.na(p2p_novar))
##                       ListedOnUTC                  BiddingStartedOn 
##                                 0                                 0 
##              BidsPortfolioManager                           BidsApi 
##                                 0                                 0 
##                        BidsManual        LoanApplicationStartedDate 
##                                 0                                 0 
##                          LoanDate                  FirstPaymentDate 
##                                 0                                 0 
##             MaturityDate_Original                 MaturityDate_Last 
##                                 0                                 0 
##             ApplicationSignedHour          ApplicationSignedWeekday 
##                                 0                                 0 
##                  VerificationType                      LanguageCode 
##                                 0                                 0 
##                               Age                            Gender 
##                                 0                                 0 
##                           Country                     AppliedAmount 
##                                 0                                 0 
##                            Amount                          Interest 
##                                 0                                 0 
##                      LoanDuration                    MonthlyPayment 
##                                 0                                 0 
##                         Education EmploymentDurationCurrentEmployer 
##                                 0                                 0 
##                 HomeOwnershipType                       IncomeTotal 
##                                 0                                 0 
##               ExistingLiabilities                  LiabilitiesTotal 
##                                 0                                 0 
##                 MonthlyPaymentDay           PlannedInterestTillDate 
##                                 0                                 0 
##                     LastPaymentOn                      ExpectedLoss 
##                                 0                                 0 
##                  LossGivenDefault                    ExpectedReturn 
##                                 0                                 0 
##              ProbabilityOfDefault        PrincipalOverdueBySchedule 
##                                 0                                 0 
##                      ModelVersion                            Rating 
##                                 0                                 0 
##                            Status                      Restructured 
##                                 0                                 0 
##             PrincipalPaymentsMade    InterestAndPenaltyPaymentsMade 
##                                 0                                 0 
##                  PrincipalBalance         InterestAndPenaltyBalance 
##                                 0                                 0 
##       NoOfPreviousLoansBeforeLoan 
##                                 0
p2p_novar

Drop unuseful columns

p2p_novar_clean2 <- p2p_novar %>% 
  select(-c(ListedOnUTC, BiddingStartedOn, BidsApi, LoanApplicationStartedDate, LoanDate, FirstPaymentDate, MaturityDate_Original, ApplicationSignedHour,ApplicationSignedWeekday, MaturityDate_Last, LastPaymentOn, ModelVersion, PrincipalBalance, InterestAndPenaltyBalance, PrincipalOverdueBySchedule, BidsPortfolioManager, BidsManual, LanguageCode, Country, ProbabilityOfDefault, NoOfPreviousLoansBeforeLoan, ExpectedReturn, ExpectedLoss, LossGivenDefault, PrincipalPaymentsMade))
p2p_novar_clean2

Create summary dataset

summary(p2p_novar_clean2)
##                      VerificationType      Age           Gender     
##  Income and expenses verified:28445   Min.   :18.00   Female:13328  
##  Income unverified           :11695   1st Qu.:30.00   Male  :27572  
##  Income verified             :  760   Median :39.00                 
##                                       Mean   :40.77                 
##                                       3rd Qu.:51.00                 
##                                       Max.   :70.00                 
##                                                                     
##  AppliedAmount       Amount         Interest       LoanDuration  
##  Min.   :  500   Min.   :  112   Min.   :  7.26   Min.   : 3.00  
##  1st Qu.: 1275   1st Qu.: 1100   1st Qu.: 29.66   1st Qu.:36.00  
##  Median : 2550   Median : 2230   Median : 36.95   Median :60.00  
##  Mean   : 3082   Mean   : 2917   Mean   : 42.09   Mean   :49.26  
##  3rd Qu.: 4250   3rd Qu.: 4250   3rd Qu.: 57.09   3rd Qu.:60.00  
##  Max.   :10632   Max.   :10632   Max.   :264.31   Max.   :60.00  
##                                                                  
##  MonthlyPayment                   Education    
##  Min.   :   0.00   Basic education     :  499  
##  1st Qu.:  64.81   Higher education    : 9595  
##  Median : 118.92   Primary education   : 5896  
##  Mean   : 142.57   Secondary education :12331  
##  3rd Qu.: 184.45   Vocational education:12579  
##  Max.   :2058.32                               
##                                                
##  EmploymentDurationCurrentEmployer                      HomeOwnershipType
##  MoreThan5Years:14591              Owner                         :13578  
##  UpTo5Years    :10073              Tenant, pre-furnished property:12373  
##  UpTo1Year     : 8169              Living with parents           : 5861  
##  Retiree       : 3949              Mortgage                      : 4324  
##  Other         : 2812              Other                         : 3701  
##  UpTo2Years    :  572              Tenant, unfurnished property  :  443  
##  (Other)       :  734              (Other)                       :  620  
##   IncomeTotal        ExistingLiabilities LiabilitiesTotal    MonthlyPaymentDay
##  Min.   :      0.2   Min.   : 0.000      Min.   :     0.00   Min.   : 0.00    
##  1st Qu.:   1000.0   1st Qu.: 0.000      1st Qu.:     0.00   1st Qu.: 4.00    
##  Median :   1430.0   Median : 1.000      Median :    74.72   Median :10.00    
##  Mean   :   1915.7   Mean   : 1.224      Mean   :   267.15   Mean   :10.45    
##  3rd Qu.:   2100.0   3rd Qu.: 2.000      3rd Qu.:   350.00   3rd Qu.:16.00    
##  Max.   :1012019.0   Max.   :32.000      Max.   :250100.00   Max.   :27.00    
##                                                                               
##  PlannedInterestTillDate     Rating          Status      Restructured
##  Min.   :    0.0         E      :11805   Default:25983   No :30677   
##  1st Qu.:  345.0         F      :10116   Repaid :14917   Yes:10223   
##  Median :  705.1         D      : 7816                               
##  Mean   : 1373.1         C      : 4559                               
##  3rd Qu.: 1957.7         HR     : 3765                               
##  Max.   :10632.0         B      : 1529                               
##                          (Other): 1310                               
##  InterestAndPenaltyPaymentsMade
##  Min.   :    0.00              
##  1st Qu.:   82.25              
##  Median :  296.11              
##  Mean   :  612.99              
##  3rd Qu.:  779.02              
##  Max.   :12902.48              
## 

6.1 Cross Validation

We have to split our data into 80% train data and 20% test data.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(120)

index <- sample(nrow(p2p_novar_clean2), nrow(p2p_novar_clean2)*0.8)

p2p_train2 <- p2p_novar_clean2[index,]
p2p_test2 <- p2p_novar_clean2[-index,]

Check head train dataset

head(p2p_train2)
prop.table(table(p2p_novar_clean2$Status))
## 
##   Default    Repaid 
## 0.6352812 0.3647188

6.2 Data Visulization

Let’s see status loan application.

plot5<- ggplot(p2p_novar) +
 aes(x = Status, fill = Status) +
 geom_bar() +
 scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
 labs(title = "Loan Status") +
 theme_economist()

ggplotly(plot5)

We can see from graphic above our repaid status is default. That means this p2p lending screening is not good enough to select worthy borrower.

We can make loan amount graphic to look at how good our funding distribution.

plot0 <- ggplot(p2p_novar) +
 aes(x = Amount, colour = Amount) +
 geom_histogram(bins = 20L, fill = "purple") +
 scale_color_distiller(palette = "PuBu", 
 direction = -1) +
 labs(title = "Funding Distribution") +
 theme_economist()

ggplotly(plot0)

From graphic above we can see most of our borrower get amount under $3000.

See our borrower age distribution

plot1 <- ggplot(p2p_novar) +
 aes(x = Age, colour = Age) +
 geom_histogram(bins = 30L, fill = "Purple") +
 scale_color_distiller(palette = "PuBu", 
 direction = -1) +
 labs(title = "Age Borrower Distribution") +
 theme_economist()

ggplotly(plot1)

Most of our borrower have age between 20 - 40 years old. Peer to peer lending company usually like this age because they have power to earn money from their job.

We make boxplot from our status vs age.

plot2 <- ggplot(p2p_novar) +
 aes(x = Status, y = Age, fill = Status) +
 geom_boxplot(shape = "circle") +
 scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
 labs(title = "Status VS Age") +
 theme_economist()

ggplotly(plot2)

Based our boxplot we can conclude our default borrower have median at 41 years old, it can be high risk if p2p lending give funding for borrower above 41 years old. Safe zone for p2p lending company is 38 years old or below.

Next we see default status and gender.

plot3 <- ggplot(p2p_novar) +
 aes(x = Gender, fill = Status) +
 geom_bar() +
 scale_fill_manual(values = c("#0073C2FF", "#EFC000FF")) +
  labs(title = "Status Vs Gender") +
 theme_economist() 

ggplotly(plot3)

Our data shows us male borrower tend to be default more than female borrower.

Last we see our verification status.

plot4 <- ggplot(p2p_novar) +
 aes(x = VerificationType, fill = Status) +
 geom_bar() +
 scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))+
 labs(title = "Verification Type Status") +
 theme_economist()

ggplotly(plot4)

We can see income and expenses verified is the highest default rate.

7 Modelling

7.0.1 Random Forest

#set.seed(120)
#ctrl <- trainControl(method = "repeatedcv",number = 3,repeats = 2)

#model_forest <- caret::train(Status ~.,data = p2p_train2, method = "rf",trControl = ctrl)

#saveRDS(model_forest, "model_forest_fix.RDS")
model_rf <- readRDS("model_forest_fix.RDS")
model_rf
## Random Forest 
## 
## 32720 samples
##    19 predictor
##     2 classes: 'Default', 'Repaid' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times) 
## Summary of sample sizes: 21813, 21813, 21814, 21813, 21814, 21813, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7857122  0.4844938
##   23    0.9173594  0.8195725
##   44    0.9138753  0.8122693
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 23.
model_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 23
## 
##         OOB estimate of  error rate: 7.9%
## Confusion matrix:
##         Default Repaid class.error
## Default   19812   1004  0.04823213
## Repaid     1581  10323  0.13281250

Prediction Random Forest Model

pred_rf <- predict(model_rf, p2p_test2, type = "raw")

Random Forest Confussion Matrix

confusionMatrix(pred_rf,reference = p2p_test2$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Default Repaid
##    Default    4919    416
##    Repaid      248   2597
##                                                
##                Accuracy : 0.9188               
##                  95% CI : (0.9127, 0.9247)     
##     No Information Rate : 0.6317               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8235               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000912      
##                                                
##             Sensitivity : 0.9520               
##             Specificity : 0.8619               
##          Pos Pred Value : 0.9220               
##          Neg Pred Value : 0.9128               
##              Prevalence : 0.6317               
##          Detection Rate : 0.6013               
##    Detection Prevalence : 0.6522               
##       Balanced Accuracy : 0.9070               
##                                                
##        'Positive' Class : Default              
## 

Variable Important

varImp(model_rf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 44)
## 
##                                    Overall
## PlannedInterestTillDate           100.0000
## AppliedAmount                      21.1524
## MonthlyPayment                     17.4886
## Amount                             15.5052
## InterestAndPenaltyPaymentsMade     12.3246
## Interest                           10.7884
## LoanDuration                       10.0518
## RestructuredYes                     8.6764
## IncomeTotal                         6.6301
## Age                                 6.2733
## LiabilitiesTotal                    5.9911
## MonthlyPaymentDay                   4.5648
## RatingHR                            3.9405
## ExistingLiabilities                 2.3519
## VerificationTypeIncome verified     1.6063
## VerificationTypeIncome unverified   0.8964
## EducationHigher education           0.8229
## HomeOwnershipTypeOwner              0.8028
## GenderMale                          0.7694
## EducationSecondary education        0.7514
plot(varImp(model_rf))

7.0.2 Decision Tree

set.seed(120)
model_dt <-ctree(Status ~.,p2p_train2)

Prediction Decision Tree Model

pred_dt <- predict(model_dt, p2p_test2, type = "response") 

Comparing Model

confusionMatrix(pred_dt, p2p_test2$Status, positive = "Default") #Decision Tree Model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Default Repaid
##    Default    4795    469
##    Repaid      372   2544
##                                                
##                Accuracy : 0.8972               
##                  95% CI : (0.8904, 0.9037)     
##     No Information Rate : 0.6317               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7776               
##                                                
##  Mcnemar's Test P-Value : 0.0009318            
##                                                
##             Sensitivity : 0.9280               
##             Specificity : 0.8443               
##          Pos Pred Value : 0.9109               
##          Neg Pred Value : 0.8724               
##              Prevalence : 0.6317               
##          Detection Rate : 0.5862               
##    Detection Prevalence : 0.6435               
##       Balanced Accuracy : 0.8862               
##                                                
##        'Positive' Class : Default              
## 
confusionMatrix(pred_rf,reference = p2p_test2$Status ) #Random Forest Model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Default Repaid
##    Default    4919    416
##    Repaid      248   2597
##                                                
##                Accuracy : 0.9188               
##                  95% CI : (0.9127, 0.9247)     
##     No Information Rate : 0.6317               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8235               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000912      
##                                                
##             Sensitivity : 0.9520               
##             Specificity : 0.8619               
##          Pos Pred Value : 0.9220               
##          Neg Pred Value : 0.9128               
##              Prevalence : 0.6317               
##          Detection Rate : 0.6013               
##    Detection Prevalence : 0.6522               
##       Balanced Accuracy : 0.9070               
##                                                
##        'Positive' Class : Default              
## 

Based on our result Random Forest model is best with approximately Sensitivity 95%. So we will embed our random forest model to our dashboard.