PREDICTING LOAN DEFAULTS WITH PREDICTIVE MODELS

DATA SUMMARY

The dataset used in this analysis is from a website www.lendingclub.com. It contains data on real loans by people. It has 30 input variables and 26,562 observations. The variables describe the credit characteristics and loan application information from loan applicants. It also includes a response variable called ‘target’ which has two levels, 1 for bad and 0 for a good loan. Bad is defined as being 30 or more days past due 13 months into the loan. This is used as a ‘surrogate’ for the loan defaulting

library(ggplot2)
library(tidyverse)

setwd("D:/R-STUDIO/STAT 515/Final_Project") #Reading the data into R
df_loan<-read.csv(file="LendingClub2021.csv",na.strings = c("","NA"))
dplyr::glimpse(df_loan)

## Rows: 26,562
## Columns: 31
## $ LOAN_ID                  <int> 290807, 347648, 348340, 357509, 362601, 36359~
## $ MOB                      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ InterestRate             <dbl> 15.70, 12.73, 11.12, 10.25, 13.23, 10.38, 9.8~
## $ IssuedDate               <chr> "2/22/2010", "2/19/2010", "7/28/2010", "3/30/~
## $ DTI                      <dbl> 20.44, 20.64, 19.72, 11.62, 22.79, 11.57, 11.~
## $ State                    <chr> "NY", "CA", "NY", "CA", "KY", "GA", "NY", "MO~
## $ HomeOwnership            <chr> "RENT", "RENT", "RENT", "RENT", "RENT", "MORT~
## $ MonthlyIncome            <dbl> 7083.00, 5000.00, 5916.67, 5000.00, 3750.00, ~
## $ EarliestCREDITLine       <chr> "1/30/1995", "3/4/2005", "1/13/2001", "11/12/~
## $ OpenCREDITLines          <int> 16, 8, 11, 13, 11, 16, 6, 13, 3, 11, 4, 4, 18~
## $ TotalCREDITLines         <int> 32, 18, 18, 22, 26, 24, 24, 39, 7, 22, 7, 17,~
## $ RevolvingCREDITBalance   <int> 28511, 3353, 20736, 5419, 5738, 10516, 7875, ~
## $ RevolvingLineUtilization <dbl> 56.7, 54.1, 53.1, 31.1, 43.8, 19.5, 25.8, 65.~
## $ Inquiries6M              <int> 1, 0, 0, 2, 1, 2, 3, 0, 0, 1, 0, 3, 3, 0, 0, ~
## $ AccountsDQ               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ DelinquentAmount         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ DQ2yrs                   <int> 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, ~
## $ MonthsSinceDQ            <int> 55, 999, 999, 999, 2, 999, 61, 22, 999, 44, 1~
## $ PublicRec                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ MonthsSinceLastRec       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ Education                <chr> "Marist College,Xavier High School", NA, NA, ~
## $ EmploymentLength         <chr> "< 1 year", "1 year", "8 years", "1 year", "2~
## $ currentpolicy            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ term                     <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 3~
## $ appl_fico_band           <chr> "670-674", "710-714", "710-714", "715-719", "~
## $ vintage                  <chr> "10Q1", "10Q1", "10Q3", "10Q1", "10Q3", "10Q3~
## $ TERM                     <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 3~
## $ MFMonth                  <chr> "0010-03-01", "0010-03-01", "0010-09-01", "00~
## $ target                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Amount.Requested         <int> 15000, 10000, 8000, 7500, 4000, 3000, 10000, ~
## $ LoanPurpose              <chr> "debt_consolidation", "house", "credit_card",~

DATA PRE-PROCESSING

The input variable “LoanPurpose” had missing observations. These observations were replaced with the string “missing”. Also, the variable “MonthsSinceDQ” has missing observations recorded as 999. This value serves as flag for missing observations, hence, were also replaced with the string “missing”. After computing the percentage of missing observations in all the variables, it was observed that the variable “MonthsSinceLastRec” and “Education” have 97.2% (25820 observations) and 98.8% (26254 observations) percent respectively, missing observations. These two variables were removed from the dataset. Other variables considered as info variables such as vintage, MFMonth, Amount Requested, TERM, term and interest rate which were removed from the dataset before the model building process. These variables appear not having significant contribution in the prediction. Some of the variables with string values were re-code. Among these include state, and EmploymentLength. Also, appl_fico_band which had ranged observations was re-code. The difference between IssuedDate and EarliestCREDITLine was computed as NoYear representing the number of years a person has been loaned since his/her first known credit line. IssuedDate and EarliestCREDITLine were also removed from the data after their difference was computed. HomeOwwnership has 4 levels with one level (OTHER) having only two values. This presented a little issue during the model building process, hence, the level was RENT which is the third level of the variable. The final dataset after the processing stage had 26,562 rows with 15 columns.

SmartEDA::ExpData(df_loan,2)#Computing percentage of missing values in each variable in the data

##    Index            Variable_Name Variable_Type Sample_n Missing_Count
## 1      1                  LOAN_ID       integer    26562             0
## 2      2                      MOB       integer    26562             0
## 3      3             InterestRate       numeric    26562             0
## 4      4               IssuedDate     character    26562             0
## 5      5                      DTI       numeric    26562             0
## 6      6                    State     character    26562             0
## 7      7            HomeOwnership     character    26562             0
## 8      8            MonthlyIncome       numeric    26562             0
## 9      9       EarliestCREDITLine     character    26562             0
## 10    10          OpenCREDITLines       integer    26562             0
## 11    11         TotalCREDITLines       integer    26562             0
## 12    12   RevolvingCREDITBalance       integer    26562             0
## 13    13 RevolvingLineUtilization       numeric    26562             0
## 14    14              Inquiries6M       integer    26562             0
## 15    15               AccountsDQ       integer    26562             0
## 16    16         DelinquentAmount       integer    26562             0
## 17    17                   DQ2yrs       integer    26562             0
## 18    18            MonthsSinceDQ       integer    26562             0
## 19    19                PublicRec       integer    26562             0
## 20    20       MonthsSinceLastRec       integer      742         25820
## 21    21                Education     character      308         26254
## 22    22         EmploymentLength     character    26562             0
## 23    23            currentpolicy       integer    26562             0
## 24    24                     term       integer    26562             0
## 25    25           appl_fico_band     character    26562             0
## 26    26                  vintage     character    26562             0
## 27    27                     TERM       integer    26562             0
## 28    28                  MFMonth     character    26562             0
## 29    29                   target       integer    26562             0
## 30    30         Amount.Requested       integer    26562             0
## 31    31              LoanPurpose     character    24120          2442
##    Per_of_Missing No_of_distinct_values
## 1           0.000                 26562
## 2           0.000                     1
## 3           0.000                   201
## 4           0.000                   527
## 5           0.000                  2750
## 6           0.000                    44
## 7           0.000                     4
## 8           0.000                  3835
## 9           0.000                  8546
## 10          0.000                    37
## 11          0.000                    78
## 12          0.000                 17440
## 13          0.000                  1067
## 14          0.000                     9
## 15          0.000                     1
## 16          0.000                     1
## 17          0.000                    11
## 18          0.000                    94
## 19          0.000                     5
## 20          0.972                    85
## 21          0.988                   278
## 22          0.000                    12
## 23          0.000                     1
## 24          0.000                     2
## 25          0.000                    34
## 26          0.000                     9
## 27          0.000                     2
## 28          0.000                    27
## 29          0.000                     2
## 30          0.000                   838
## 31          0.092                    13

df_loan$MonthsSinceDQ<-ifelse(df_loan$MonthsSinceDQ==999,"missing",df_loan$MonthsSinceDQ)#Replacing 999 in MonthsSinceDQ with missing
df_loan$LoanPurpose<-ifelse(is.na(df_loan$LoanPurpose),'missing',df_loan$LoanPurpose) #Replacing NA's in LoanPurpose with missing
df_loan$NoYear<-as.numeric(format(as.Date(df_loan$IssuedDate,
                                          format = "%m/%d/%Y"),"%Y"))-as.numeric(format(as.Date(df_loan$EarliestCREDITLine,
                                                                                                format = "%m/%d/%Y"),"%Y")) #Computing the difference between IssuedDate and EarliestCREDITLine 
                                                                                                                            #and creating a new column "NoYear" to contain the difference

df_loan<-df_loan[,-c(4,9,20:21,24,26,28)] #Dropping term (same as TERM),Education, MonthsSinceLastRec and other info variables such as MFMonth, vintage, and IssuedDate, EarliestCREDITLine from the data
#df_loan[,c(5:6,16,18:20,26)]<-lapply(df_loan[,c(5:6,16,18:20,26)], as.factor)

EXPLORATORY ANALYSIS

numeric_vars<-df_loan%>% #Selecting only the numeric variables in the data
  select(MonthlyIncome,Amount.Requested,MOB,InterestRate,DTI,OpenCREDITLines,TotalCREDITLines,RevolvingCREDITBalance,
         RevolvingLineUtilization,Inquiries6M,AccountsDQ,DelinquentAmount,DQ2yrs,PublicRec,currentpolicy,TERM,NoYear)
summary(numeric_vars)#Summary statistics on the numeric variables

##  MonthlyIncome    Amount.Requested      MOB     InterestRate        DTI       
##  Min.   :   340   Min.   : 1000    Min.   :1   Min.   : 6.00   Min.   : 0.00  
##  1st Qu.:  3489   1st Qu.: 6000    1st Qu.:1   1st Qu.:11.12   1st Qu.: 9.13  
##  Median :  5000   Median :10612    Median :1   Median :13.48   Median :14.24  
##  Mean   :  5841   Mean   :12578    Mean   :1   Mean   :13.80   Mean   :13.99  
##  3rd Qu.:  7000   3rd Qu.:17500    3rd Qu.:1   3rd Qu.:15.96   3rd Qu.:19.16  
##  Max.   :500000   Max.   :35000    Max.   :1   Max.   :24.59   Max.   :29.95  
##  OpenCREDITLines  TotalCREDITLines RevolvingCREDITBalance
##  Min.   : 2.000   Min.   : 3.00    Min.   :     0        
##  1st Qu.: 6.000   1st Qu.:13.00    1st Qu.:  4447        
##  Median : 9.000   Median :20.00    Median :  9878        
##  Mean   : 9.275   Mean   :21.84    Mean   : 14033        
##  3rd Qu.:12.000   3rd Qu.:29.00    3rd Qu.: 18052        
##  Max.   :39.000   Max.   :90.00    Max.   :149588        
##  RevolvingLineUtilization  Inquiries6M       AccountsDQ DelinquentAmount
##  Min.   : 0.00            Min.   :0.0000   Min.   :0    Min.   :0       
##  1st Qu.:37.20            1st Qu.:0.0000   1st Qu.:0    1st Qu.:0       
##  Median :59.60            Median :1.0000   Median :0    Median :0       
##  Mean   :56.67            Mean   :0.9136   Mean   :0    Mean   :0       
##  3rd Qu.:78.90            3rd Qu.:2.0000   3rd Qu.:0    3rd Qu.:0       
##  Max.   :99.90            Max.   :8.0000   Max.   :0    Max.   :0       
##      DQ2yrs          PublicRec       currentpolicy      TERM      
##  Min.   : 0.0000   Min.   :0.00000   Min.   :1     Min.   :36.00  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:1     1st Qu.:36.00  
##  Median : 0.0000   Median :0.00000   Median :1     Median :36.00  
##  Mean   : 0.1769   Mean   :0.06649   Mean   :1     Mean   :45.74  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:1     3rd Qu.:60.00  
##  Max.   :11.0000   Max.   :4.00000   Max.   :1     Max.   :60.00  
##      NoYear     
##  Min.   : 3.00  
##  1st Qu.: 9.00  
##  Median :12.00  
##  Mean   :13.56  
##  3rd Qu.:17.00  
##  Max.   :65.00

The summary statistics on the numeric variables shows that no customer has a delinquent amount which correlates with the account delinquent as there is none for the customers. It appears there is zero variance in MOB and current policy. It appears all the customers have been only a month long since the records were taken. It also appears the same credit policy is applied to all the customers. Some of the customers earn as low as 340 dollars and has high as 500,000 dollars monthly. However, the monthly income of most of the customers is 5841 dollars. The lowest amount of loan requested is 1000 dollars and the highest is 35,000 dollars. Interest rate can go as low as 6% and as high as 24.59%.

numeric_vars<-numeric_vars[,-c(3,11:12,15:16)]#Dropping MOB, AccountDQ, DelinquentAmount and currentpolicy

plotHist <- function(columns,bin,colours){
  par(mfrow = c(3,4))#Histogram plots to visualize the distribution of the numeric variables in the data set.
  for (i in columns) {
    hist(numeric_vars[,i], main = paste("Histogram of ", names(numeric_vars)[i]),
         nclass = bin, las = 1, col = colours, 
         xlab = paste(names(numeric_vars)[i]))
  }
}
plotHist(c(1:12), rep(5,12), "turquoise2")

Figure 1: Histogram on the numeric variables

From figure 1, the distribution of the variable Amount.Requested, DTI and InterestRate, TotalCREDITLines and RevolvinggLineUtilization appears to be normally distributed. Most of the variables are not normally distributed. About 17,587 (66%) of the customers have 10 or less open credit lines whereas about 723 (2.7%) have 20 more open credit lines. The number of months for payment of a particular loan among the customers is usually 36 or 60 months with 60 months being the most common.

df_loan %>%
  select(MonthlyIncome,Amount.Requested,InterestRate,DTI,OpenCREDITLines,
         TotalCREDITLines,RevolvingCREDITBalance,RevolvingLineUtilization,
         Inquiries6M,TERM,NoYear,target) %>% 
  mutate(target = recode_factor(target,
                                `0` = "Good", 
                                `1` = "Bad" )) %>%
  gather(key   = "key", 
         value = "value",-target) %>%
  
  #Distribution of the target variable (0=Good,1=Bad) among the numeric variables
  #Since no customer is indebted or has any standing delinquent amount, this variable together with AccountsDQ were removed from the data
  ggplot(aes(y = value)) +
  geom_boxplot(aes(fill = target),
               alpha  = .6,
               fatten = .7) + 
  labs(x = "",
       y = "",
       title = "Boxplots of Numeric Variables") +
  scale_fill_manual(
    values = c("turquoise2","turquoise4"),
    name   = "Target\nBad",
    labels = c("Good", "Bad")) +
  theme_bw() +
  facet_wrap(~ key, 
             scales = "free", 
             ncol   = 4)

Figure 2: Boxplot of numeric variables

In the figure above, it is observed that the proportion of the amount of loan requested by the two category of customers (Bad =“1” and Good =“0”) is equal. However, interest rate varies among these two groups, with that of Bad loan customers (customers likely to default) being higher than of Good loan customers (customers with low chance of defaulting). Revolving credit balance divided by total revolving credit limit (RevolvingLineUtilization) varies among Bad and Good loan customers with that of bad loan customers being slightly high. Number of months of loan payment for a loan do not affect whether a customer will default or not as the distribution is equal among the two groups of customers.

Bad<-df_loan[,-c(2,13:14,19)] %>% #Dropping MOB, AccountDQ, DelinquentAmount and currentpolicy
  filter(target==1)

Good<-df_loan[,-c(2,13:14,19)] %>%
  filter(target==0)


counts<-function(x,y,z){
  n=nrow(x)
  B_acc=nrow(y)
  G_acc=nrow(Good)
  P_B_acc=(B_acc/n)*100
  P_G_acc=((G_acc)/n)*100
  return(list(num_bad_account=B_acc,num_good_account=G_acc,
              percent_bad_account=P_B_acc,percent_good_account=P_G_acc))
}
counts(df_loan,Bad,Good)

## $num_bad_account
## [1] 2052
## 
## $num_good_account
## [1] 24510
## 
## $percent_bad_account
## [1] 7.725322
## 
## $percent_good_account
## [1] 92.27468

The total number of Bad loan and Good loan recorded is 2052 and 24510 representing approximately 7.7 and 92.3, respectively.

Bad %>% #Computing the summary statistics of amount requested, monthly income and the number of years a customer has been loaned among the bad accounts
  select(Amount.Requested,MonthlyIncome,InterestRate,NoYear) %>%
  summary()

##  Amount.Requested MonthlyIncome    InterestRate       NoYear     
##  Min.   : 1000    Min.   :  500   Min.   : 6.00   Min.   : 3.00  
##  1st Qu.: 5844    1st Qu.: 3082   1st Qu.:12.42   1st Qu.: 9.00  
##  Median :10800    Median : 4400   Median :14.65   Median :12.00  
##  Mean   :12607    Mean   : 5106   Mean   :14.91   Mean   :13.06  
##  3rd Qu.:18000    3rd Qu.: 6118   3rd Qu.:17.08   3rd Qu.:16.00  
##  Max.   :35000    Max.   :70000   Max.   :24.40   Max.   :48.00

Good %>% #Computing the summary statistics of amount requested, monthly income and the number of years a customer has been loaned among the good accounts
  select(Amount.Requested,MonthlyIncome,InterestRate,NoYear) %>%
  summary()

##  Amount.Requested MonthlyIncome     InterestRate       NoYear    
##  Min.   : 1000    Min.   :   340   Min.   : 6.00   Min.   : 3.0  
##  1st Qu.: 6000    1st Qu.:  3500   1st Qu.:11.12   1st Qu.: 9.0  
##  Median :10600    Median :  5000   Median :13.35   Median :12.0  
##  Mean   :12575    Mean   :  5903   Mean   :13.71   Mean   :13.6  
##  3rd Qu.:17500    3rd Qu.:  7083   3rd Qu.:15.70   3rd Qu.:17.0  
##  Max.   :35000    Max.   :500000   Max.   :24.59   Max.   :65.0

After computing summary statistics of amount of money requested, monthly income, interest rate and number of years on both Bad and Good loan customers, it appears the minimum and maximum amount of money requested by both groups is approximately equal with most of the amount requested centered around 12,500 dollar (fig. 5 & 6). The minimum monthly income observed among Bad loan customers is 500 dollars with 70,000 dollars being the maximum. This differs when compared to the Good loan customers with minimum of 340 dollars and maximum amount of 500,000 dollars. A plot with the frequency of customers with monthly income greater or equal to 20,000 dollars: figure 9 shows that among the two groups, majority of the Good loan customers have monthly income greater than 20,000 with most of them earning 25,000 monthly. Most of the Good customers have monthly income of approximately 5,900 which exceeds that of the Bad customers with approximately 800 dollars. Interest rate on loans requested appears to be approximately equal among the two groups with that of Bad loan customers being slightly high. It appears Good loan customers have handled money with the LendingClub longer than the Bad loan customers. The number of years customers in both group have been employed at their current jobs appears to be equal (fig. 3 & 4).

#Bar plot on Employment Length
ggplot(Bad, aes(x =EmploymentLength)) + 
  geom_bar(color = "white", fill = "#A4AA83", binwidth = .5) + 
  theme_minimal() + 
  labs(title = "Barplot for Employment Length on Bad Loan Customers", x = "Employment Length", 
       y = "Frequency")

Figure 3: Bar plot on Employment Length

ggplot(Good, aes(x =EmploymentLength)) + 
  geom_bar(color = "white", fill = "#A4AA83", binwidth = .5) + 
  theme_minimal() + 
  labs(title = "Barplot for Employment Length on Good Loan Customers", x = "Employment Length", 
       y = "Frequency")

Figure 4: Bar plot on Employment Length

#Histogram on Amount of Money Requested
ggplot(Bad, aes(x =Amount.Requested)) + 
  geom_histogram(color = "white", fill = "#A4AA83") + 
  theme_minimal() + 
  labs(title = "Histogram for Amount Requested by Bad Loan Customers", x = "Amount of Money Requested", 
       y = "Frequency") + xlim(0,35000)

Figure 5: Histogram of amount of money requested by group

ggplot(Good, aes(x =Amount.Requested)) + 
  geom_histogram(color = "white", fill = "#A4AA83") + 
  theme_minimal() + 
  labs(title = "Histogram for Amount Requested by Good Loan Customers", x = "Amount of Money Requested", 
       y = "Frequency") + xlim(0,35000)

Figure 6: Histogram of amount of money requested by group

#Histogram on Amount of Monthly Income
ggplot(Bad, aes(x = MonthlyIncome)) + 
  geom_histogram(color = "white", fill = "#A4AA83") + 
  theme_minimal() + 
  labs(title = "Histogram for Monthly Income on Bad Loan Customers", x = "Monthly Income", 
       y = "Frequency") + xlim(0,20000)

Figure 7: Histogram on monthly income by group

ggplot(Good, aes(x = MonthlyIncome)) + 
  geom_histogram(color = "white", fill = "#A4AA83") + 
  theme_minimal() + 
  labs(title = "Histogram for Monthly Income on Good Loan Customers", x = "Monthly Income", 
       y = "Frequency") + 
  xlim(0, 20000)

Figure 8: Histogram on monthly income by group

A<-Bad %>% 
  filter(MonthlyIncome >=20000) %>%
  ggplot(aes(MonthlyIncome)) + geom_histogram(color="white",fill="#A4AA83") + 
  theme_minimal() + labs(title = "Monthly Income >= 20000 on Bad Account",
                         x="Monthly Income >= 20000",y="Frequency") + xlim(20000,70000)

B<-Good %>% 
  filter(MonthlyIncome >=20000) %>%
  ggplot(aes(MonthlyIncome)) + geom_histogram(color="white",fill="#A4AA83") + 
  theme_minimal() + labs(title = "Monthly Income >= 20000 on Good Account",
                         x="Monthly Income >= 20000",y="Frequency") + xlim(20000,70000)
gridExtra::grid.arrange(A,B,ncol=2)

Figure 9: Monthly income >= 20000 by group

ggplot(df_loan, aes(y = HomeOwnership, 
                       LoanPurpose)) + geom_count(color = "brown") + theme(axis.text.x = element_text(angle = 90, 
                                                                                                      hjust = 1))

Figure 10: A plot of Number of Loans by HomeOwnership and Loan Purpose, and Interest Rate vs Loan Purpose

ggplot(df_loan, aes(y = InterestRate, 
                       LoanPurpose,fill=LoanPurpose)) + geom_violin(trim = FALSE) + 
  #A plot showing distribution of Amount of loan requested for the different levels in LoanPurpose
  geom_boxplot(width = 0.1) + theme(axis.text.x = element_text(angle = 90, 
                                                               hjust = 1))

Figure 11: A plot of Number of Loans by HomeOwnership and Loan Purpose, and Interest Rate vs Loan Purpose

Figure 11 shows the number of loans grouped by Home Ownership and the purpose of the loan. Debt Consolidation is the most common purpose of loans. But the proportion is different among different types of home ownership. Most mortgage and rent owners have the highest number of loans for Debt Consolidation and credit cards. Individuals with own homes do not really apply for either of the loan purpose given. Interest rate on loans requested for cars, credit card, debt consolidation and education are significantly higher compared to the other loan purpose with that of cars being highest. However, the interest rate of loans on cars and education are significantly low compared to small business which is not often requested.

ggplot(df_loan, aes(x =LoanPurpose, y = Amount.Requested, fill = LoanPurpose)) + 
  geom_violin(trim = FALSE) + #A plot showing distribution of Amount of loan requested for the different levels in LoanPurpose
  geom_boxplot(width = 0.1)

Figure 12: A plot of Amount of Loan Requested grouped by Loan Purpose

sta.income <-df_loan%>% mutate(Fico_Score=ifelse(appl_fico_band<740,"Good",
                                 ifelse(appl_fico_band>800,"Excellent","Very Good")),
  
                                IncomeType = factor(ifelse(MonthlyIncome < 4000, "Under4000",  
                                                            ifelse(MonthlyIncome < 6000 & MonthlyIncome >= 4000, "4000-6000",
                                                                   ifelse(MonthlyIncome < 8000 & MonthlyIncome >= 6000, "6000-8000",
                                                                          ifelse(MonthlyIncome < 10000 & MonthlyIncome >= 8000, "8000-10000", 
                                                                                 "Above10000"))))),
                              
                               )

ggplot(sta.income, aes(Fico_Score,InterestRate,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+
  theme()+labs(title="Box plot of Interest rate vs Fico_Score")

Figure 13: Box plot of Interest rate vs Fico_Score

Figure 13 shows that as credit score of a customer increases, the interest rate on a loan requested by the customer becomes deceases which makes sense as these loans would most likely be less risky. This is also observed in figure 14 . Interest rate charged on amount of loan requested by a customer is largely dependent on the credit score of that customer. Customers with excellent credit score requesting significantly high amount of loans turns to have low interest rate compared to customers with good credit score but requesting low amount. In respective of the income type, customers with excellent credit score requests significantly high amount of loans compared to other groups: 15 .There is no significant association between amount of loan requested and interest rate: figure 16 .

ggplot(sta.income, aes(Amount.Requested,InterestRate,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+ 
  theme()+labs(title="Box plot of Amount of Loan Requested vs Interest Rate")

Figure 14: Box plot of Amount of Loan Requested vs Interest Rate

ggplot(sta.income, aes(IncomeType,Amount.Requested,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+
  theme()+labs(title="Box plot of Amount of Loan Requested vs IncomeType")

Figure 15: Box plot of Amount of Loan Requested vs IncomeType

ggplot(sta.income, aes(Amount.Requested, InterestRate)) + geom_point() + geom_smooth() + xlab("Amount of Loan Requested") +
  ylab("InterestRate")+labs(title="Scatterplot of Interest Rate Vs. Inquiries")

Figure 16: Scatterplot of Interest Rate Vs. Inquiries

ggplot(sta.income, aes(x = State, y = Amount.Requested)) +
  geom_point(size=3, color="turquoise4") + 
  geom_segment(aes(x = State, xend = State, y = 0, yend = Amount.Requested)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+labs(title="Loppipop plot of Amount of Loan Requested in each State")

Figure 17: Loppipop plot of Amount of Loan Requested in each State

Figure 17 shows the amount of money requested in each state. It is observed that the amount of money requested in California, Florida, New York, Texas and Maryland are significantly high compared to other states. Amount of money requested is significantly low in Mississippi which is evident by their low population and low standard of living. States with low standard of living such as South Dakota, Arkansas, Vermont, among others request low amount of money.

df_loan<-df_loan %>%
  mutate(State=ifelse(State %in% c("CT","MA","MD","NH","NJ","NY","PA","RI","VT"),"NorthEast",
                      ifelse(State %in% c("MS","AL","GA","LA","AR","FL","SC","NC","KY","VA","WV","DE","DC"),"SouthEast",
                             ifelse(State %in% c("CA","WA","OR","MT","WY","NV","UT","CO","HI","AK"),"West",
                                    ifelse(State %in% c("SD","KS","MN","MO","IL","OH","MI","WI"),"MidWest","SouthWest")))),
         
          EmploymentLength=recode_factor(EmploymentLength,                    
                                        `< 1 year` = "1",`1 year` = "1", `2 years` = "2",`3 years` = "3",
                                        `4 years` = "4",`5 years` = "5",`6 years` = "6",`6 years` = "6",
                                        `7 years` = "7",`8 years` = "8",`9 years` = "9",`n/a` = "0",
                                        `10+ years` = "10" ),
         
         appl_fico_band=ifelse(appl_fico_band<740,"Good",
                                 ifelse(appl_fico_band>800,"Excellent","Very Good"))
  ) %>%
  rename(
    Region=State
  )

factor_vars<-df_loan%>% #Generating the distribution of the target (0=Good, 1=Bad) among the categorical variables
  select(HomeOwnership,appl_fico_band,TERM,Region,target) %>%
  mutate(target = recode_factor(target, 
                                `0` = "Good", 
                                `1` = "Bad" ),
         TERM=recode_factor(TERM,
                            `36` = "Short Term", 
                            `60` = "Long Term" )) %>%
  gather(key = "key", value = "value",-target) 

factor_vars%>% 
  ggplot(aes(value)) +
  geom_bar(aes(x    = value, 
               fill  = target), 
           alpha    = .6, 
           position = "dodge",
           backgroundColor="white",
           color    = "black",
           width    = .8
  ) +
  labs(x = "",
       y = "",
       title = "Barplot of Categorical Variables vs Target (Bad or Good)") +
  theme(
    axis.text.y  = element_blank(),
    axis.ticks.y = element_blank()) +
  facet_wrap(~ key, scales = "free", nrow = 3) +
  scale_fill_manual(
    values = c("turquoise4", "turquoise2"),
    name   = "Target\nBad",
    labels = c("Good", "Bad"))

Figure 18: Barplot of Categorical variables vs Target (Bad or Good)

In figure 18, only few of both good and bad loan customers have excellent credit score. Majority of the good customers have good credit score compared to the bad customers and few have very good credit score which is significantly higher than the bad customers. Grouping by home ownership shows that most of the customers that rent or are on mortgage do not default in their loan payment. Among customers who own their own home, most of them do not default in their loan payment. Number of months payments for a loan do not show whether a customer will default or not in their loan payment. Among both long and shot payment, the number of good customers is significantly high compared to bad customers. Even though figure 19 shows the amount of money requested by customers in the 5 regions are approximately equal, figure 18 shows that the proportion of good customers to bad customers taking into account the total number of borrowers from each of the region is approximately

ggplot(df_loan, aes(y = Amount.Requested, 
                       Region,fill=Region)) + geom_violin(trim = FALSE) + 
  geom_boxplot(width = 0.1) + theme(axis.text.x = element_text(angle = 90, 
                                                               hjust = 1))

Figure 19: A plot showing distribution of Amount of Money Requested in Regions

df_loan$EmploymentLength<-as.numeric(df_loan$EmploymentLength)
df_loan<-df_loan[,-c(2:3,13:14,19,21,23)]

df<-df_loan[,c(2,5:11,13:14,18)]
corr<-round(cor(df),1)
ggcorrplot::ggcorrplot(corr, hc.order = TRUE, outline.col = "white")

Figure 20: Correlation matrix of the numeric variables

# using hierarchical clustering for computing correlation among the numeric predictors.

df_loan<-df_loan[-6]#Removing number of opening credit lines from the data since it's highly correlated with nuumber of total credit lines
df_loan[, c(3:4,14)]<-lapply(df_loan[,c(3:4,14)],as.factor)#Converting the re-coded variables into their original forms
levels(df_loan$HomeOwnership)[2]<-"RENT" #There are only 2 values for the level OTHER, hence, converted it to RENT

MODEL BUILDING

Building Multivariate Adaptive Regression Splines (MARS) Model

Feature selection:

The nature of MARS features breaks predictor variables into two distinct groups and models linear relationships between the predictor and the outcome in each group. Due to its robustness in terms of handling input variables, data used for MARS models needs little pre-processing such that transformation and filtering of input variables are not necessary. Even though discrepancies among input variables such as their strong correlation with one another, do not significantly affect the model’s performance, they can complicate the model’s interpretation. To circumvent this shortcoming, correlation matrix was computed among the numeric predictors. The correlation between TotalCREDITLines and OpenCREDITLines was a value greater than 0.7: figure 20. Hence, TotalCREDITLines was removed from the data. Total of 6 out of 17 predictors that entered the model were retained. This is because the model thinned the predictors and retain only those significant for the prediction.

Data Splitting:

The MARS model was built with the dataset containing the 18 input variables and the target variable, target. The data set was split randomly at 60% (15938 observations) as the train data and 40% (10624) as the test data. The MARS model was built on the train data using the earth package, and the model performance evaluated on both the train and test data.

set.seed(551)
target<-df_loan$target
predictors<-df_loan[,-c(1,15)]
trainRows<-caret::createDataPartition(target,p=.60,list = FALSE) #Randomly spliiting the data into 40% test data and 60% train data
trainPredictors<-predictors[trainRows,]
trainTarget<-target[trainRows]
testPredictors<-predictors[-trainRows,]
testTarget<-target[-trainRows]

marsfit<-earth::earth(trainPredictors,trainTarget,glm=list(family=binomial),degree=1)#Performing the MARS model using the earth package.
summary(marsfit) #Using the summary() to generate more extensive output from the model

## Call: earth(x=trainPredictors, y=trainTarget, glm=list(family=binomial),
##             degree=1)
## 
## GLM coefficients
##                                  trainTarget
## (Intercept)                       -3.2583129
## MonthsSinceDQ13                    1.0955130
## LoanPurposecredit_card            -0.4607224
## LoanPurposesmall_business          0.9032795
## h(9750-MonthlyIncome)              0.0001490
## h(15.2-RevolvingLineUtilization)   0.0518046
## h(RevolvingLineUtilization-15.2)   0.0124722
## h(3-Inquiries6M)                  -0.2563617
## 
## GLM (family binomial, link logit):
##  nulldev    df       dev    df   devratio     AIC iters converged
##   8817.3 15937   8465.82 15930     0.0399    8482     5         1
## 
## Earth selected 8 of 10 terms, and 6 of 119 predictors
## Termination condition: RSq changed by less than 0.001 at 10 terms
## Importance: MonthlyIncome, LoanPurposesmall_business, Inquiries6M, ...
## Number of terms at each degree of interaction: 1 7 (additive model)
## Earth GCV 0.07134819    RSS 1135.008    GRSq 0.02086246    RSq 0.02258197

Summary statistics of MARS model can be difficult to explain. A negative coefficient does not mean a negative association between the target and the predictor. Customers with 13 or more months of delinquency since the last record was taken have higher chance of defaulting on their loan. Also, customers that request loan for credit card purpose stands low chance of defaulting. However, customers with a request for their small business have a high chance of defaulting. customers with monthly income less than 9750 dollars have low chance of defaulting. In terms of revolving line utilization, customers with more than 15.2 revolving lines have low chance of defaulting compared to customers with less than 15.2 revolving lines. If a customer has made less than 3 inquiries for new credit in the last 6 months, there is significantly low chance of the customer defaulting.

kableExtra::kable(caret::varImp(marsfit),caption = "Table.1: Earth variable importance") %>%
  kableExtra::kable_classic(html_font="Times New Roman")#Predictors selected by the MARS model

Table 1: Table.1: Earth variable importance
	Overall
MonthlyIncome	100.00000
LoanPurposesmall_business	83.14410
Inquiries6M	69.49103
RevolvingLineUtilization	53.89279
LoanPurposecredit_card	28.84886
MonthsSinceDQ13	19.16641

Table 1 shows the predictors selected by the earth function during the model building as significant in the prediction. Their overall importance is scaled between 0 and 100.

marsTest_pred<-predict(marsfit,newdata=testPredictors,type="response")#Validating the model on the test data
testTarget_testPred<-data.frame(obs=testTarget,
                                pred=marsTest_pred)
testTarget_testPred %>% 
  rename(
    pred=trainTarget
  ) %>%
  caret::defaultSummary()

##       RMSE   Rsquared        MAE 
## 0.26080554 0.01458389 0.13838347

marsTrain_pred<-predict(marsfit,newdata=trainPredictors,type="response")#Validating the model on the train data
trainTarget_trainPred<-data.frame(obs=trainTarget,
                                  pred=marsTrain_pred)
trainTarget_trainPred %>% 
  rename(
    pred=trainTarget,
  ) %>%
  caret::defaultSummary()

##       RMSE   Rsquared        MAE 
## 0.26674862 0.02340963 0.14221774

The performance of the MARS model on both the train and test data was statistically significant as observed by its low RMSE value on both the train (0.26674862) and test data (0.26080554). This was also true when their MAE (0.14221774 and 0.13838347 for the train and test data, respectively) were computed. Even though its low Mean Absolute Error rate do not tell the model’s performance, it suggests that the MARS model is great at prediction. ROC curves for both the train and test data together with the area under each curve (auc) were computed. As shown in figure 21, the model’s performance rate on both the train and test data appears statistically significant as the value of the Area under the ROC curve (0.65 and 0.62 for the train and test data, respectively) is close to 1. Their computed AUROC curves are comparable, indicating the model do not overfit the test data.

df_marsTrain_pred<-tibble(trainTarget,marsTrain_pred)#Generating the scoring data for the MARS model on the train data
sort_df_marsTrain_pred<-(df_marsTrain_pred[order(df_marsTrain_pred[,2],decreasing = T),])
#DT::datatable(df_marsTrain_pred,filter = "top",caption="Table.2: Predicted verses Actual values in the train Data (MARS Model)")

df_marsTest_pred<-tibble(testTarget,marsTest_pred)#Generating the scoring data for the logistic model on the test data
sort_mars_Test_pred<-(df_marsTest_pred[order(df_marsTest_pred[,2],decreasing = T),])
#DT::datatable(sort_mars_target_pred,filter = "top",caption="Table.3: Predicted verses Actual values in the test Data (Logistic Model)")
#write.csv(sort_mars_target_pred,"MARS_Scoring.csv")

if(require(ROCR)){trainpred<-prediction(sort_df_marsTrain_pred$marsTrain_pred,sort_df_marsTrain_pred$trainTarget)}
if(require('ROCR')){trainperf<-performance(trainpred,"tpr","fpr")}                    #Generate a plot of True Positive Rate vs False Positive Rate from the logistic model
marsTrain_auc<-performance(trainpred, measure = "auc")   #Computes the area under the ROC curve     
print(paste(marsTrain_auc@y.values,"AUC ON TRAIN DATA"))

## [1] "0.651595090410237 AUC ON TRAIN DATA"

if(require(ROCR)){testpred<-prediction(sort_mars_Test_pred$marsTest_pred,sort_mars_Test_pred$testTarget)}
if(require('ROCR')){testperf<-performance(testpred,"tpr","fpr")}                    #Generate a plot of True Positive Rate vs False Positive Rate from the logistic model
mars_test_auc<-performance(testpred, measure = "auc")   #Computes the area under the ROC curve     
print(paste(mars_test_auc@y.values,"AUC ON TEST DATA"))

## [1] "0.618229296508351 AUC ON TEST DATA"

plot(unlist(slot(trainperf,"x.values")),unlist(slot(trainperf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="green",
     main="Receiver Operating Characteristic (ROC) Plot on the MARS Model")
points(unlist(slot(testperf,"x.values")),unlist(slot(testperf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="blue")
abline(a=0,b=1,col="red")
legend("topleft", legend=c("TRAIN_AUC","Test_AUC","Baseline_AUC"),
       col=c("green","blue","red"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("Train_AUC=0.65","Test_AUC=0.62"),
       col=c("green","blue"), lty=1:2, cex=0.6)

Figure 21: Receiver Operating Characteristic (ROC) plot on the MARS model

if(require(gains)){Gains<-gains(testTarget,marsTest_pred)}#Computing gains for the model prediction on the test data
Gains

## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##   10  1062   1062      0.14      0.14      19.3%     193    193      0.17
##   20  1062   2124      0.10      0.12      32.7%     134    164      0.12
##   30  1063   3187      0.10      0.11      45.8%     130    153      0.10
##   40  1062   4249      0.07      0.10      55.6%      99    139      0.08
##   50  1063   5312      0.07      0.10      65.5%      99    131      0.07
##   60  1062   6374      0.06      0.09      73.1%      76    122      0.07
##   70  1062   7436      0.06      0.09      81.0%      80    116      0.06
##   80  1063   8499      0.06      0.08      88.6%      76    111      0.05
##   90  1062   9561      0.04      0.08      94.1%      54    105      0.04
##  100  1063  10624      0.04      0.07     100.0%      59    100      0.03

From figure 22 and the gains table above, it is observed that depth of file of 50 which has cumulative observation of 5312 out of the 10624 has significant cumulative percentage of 65.5% out of the total response. This implies that in case of predicting customers with the probability of having bad loans or defaulting, if we target the first 50% of the customers (5312) from the predictions made by the model, 65.5% percent of them will default on their loan payment. Thus, approximately 3,480 out of the 5312 customers will default on their loans.

par(mfrow=c(1,3))#A plot of the Mean Response and the Deciles 
plot(Gains,ylim = c(0,0.1),xlim=c(0,100))

barplot(Gains$mean.resp, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
        ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(Gains$cume.pct.of.total, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
        ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")#A barplot of the cummulative percentage of the total response vs the deciles

Figure 22: Gains plot and barplot of Mean Response, and Cumulative Percent of Total Response on Depth of File from the MARS Model

Building Logistic Model

step_model=lm(target ~ .,data=df_loan[,-1]) 
Stepwise_model=step(step_model,direction="both",test="F")#Stepwise selection was performed on the variables to select those that appear significant at 0.5 significant level

## Start:  AIC=-70496.17
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + TotalCREDITLines + 
##     RevolvingCREDITBalance + RevolvingLineUtilization + Inquiries6M + 
##     DQ2yrs + MonthsSinceDQ + PublicRec + EmploymentLength + appl_fico_band + 
##     LoanPurpose + NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## - MonthsSinceDQ            93    6.4525 1858.0 -70590   0.9906  0.506091    
## - TotalCREDITLines          1    0.0046 1851.6 -70498   0.0657  0.797662    
## - RevolvingCREDITBalance    1    0.0219 1851.6 -70498   0.3132  0.575719    
## - EmploymentLength          1    0.0447 1851.6 -70498   0.6381  0.424413    
## - DQ2yrs                    1    0.0736 1851.7 -70497   1.0511  0.305256    
## - DTI                       1    0.0888 1851.7 -70497   1.2676  0.260222    
## - appl_fico_band            2    0.2614 1851.8 -70496   1.8661  0.154739    
## <none>                                  1851.6 -70496                       
## - NoYear                    1    0.1713 1851.8 -70496   2.4458  0.117852    
## - Region                    4    1.3598 1853.0 -70485   4.8539  0.000653 ***
## - PublicRec                 1    1.1066 1852.7 -70482  15.8006 7.057e-05 ***
## - HomeOwnership             2    1.5197 1853.1 -70478  10.8489 1.951e-05 ***
## - MonthlyIncome             1    1.7061 1853.3 -70474  24.3602 8.039e-07 ***
## - RevolvingLineUtilization  1    2.7237 1854.3 -70459  38.8890 4.554e-10 ***
## - Inquiries6M               1    8.3918 1860.0 -70378 119.8184 < 2.2e-16 ***
## - LoanPurpose              13   15.6703 1867.3 -70298  17.2108 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-70589.77
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + TotalCREDITLines + 
##     RevolvingCREDITBalance + RevolvingLineUtilization + Inquiries6M + 
##     DQ2yrs + PublicRec + EmploymentLength + appl_fico_band + 
##     LoanPurpose + NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## - TotalCREDITLines          1    0.0208 1858.1 -70591   0.2968 0.5859086    
## - EmploymentLength          1    0.0315 1858.1 -70591   0.4500 0.5023201    
## - RevolvingCREDITBalance    1    0.0434 1858.1 -70591   0.6198 0.4311121    
## - appl_fico_band            2    0.2097 1858.2 -70591   1.4973 0.2237446    
## - DTI                       1    0.1366 1858.2 -70590   1.9506 0.1625306    
## <none>                                  1858.0 -70590                       
## - NoYear                    1    0.2359 1858.3 -70588   3.3679 0.0664888 .  
## - DQ2yrs                    1    0.8047 1858.8 -70580  11.4898 0.0007008 ***
## - Region                    4    1.3622 1859.4 -70578   4.8625 0.0006429 ***
## - PublicRec                 1    1.1381 1859.2 -70576  16.2502 5.566e-05 ***
## - HomeOwnership             2    1.4749 1859.5 -70573  10.5299 2.684e-05 ***
## - MonthlyIncome             1    1.6284 1859.7 -70569  23.2503 1.430e-06 ***
## - RevolvingLineUtilization  1    2.8170 1860.9 -70552  40.2228 2.302e-10 ***
## + MonthsSinceDQ            93    6.4525 1851.6 -70496   0.9906 0.5060910    
## - Inquiries6M               1    8.6131 1866.7 -70469 122.9822 < 2.2e-16 ***
## - LoanPurpose              13   15.6729 1873.7 -70393  17.2142 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-70591.47
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingCREDITBalance + 
##     RevolvingLineUtilization + Inquiries6M + DQ2yrs + PublicRec + 
##     EmploymentLength + appl_fico_band + LoanPurpose + NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## - EmploymentLength          1    0.0299 1858.1 -70593   0.4271 0.5134082    
## - RevolvingCREDITBalance    1    0.0352 1858.1 -70593   0.5026 0.4783493    
## - appl_fico_band            2    0.2103 1858.3 -70592   1.5013 0.2228486    
## - DTI                       1    0.1183 1858.2 -70592   1.6892 0.1937227    
## <none>                                  1858.1 -70591                       
## + TotalCREDITLines          1    0.0208 1858.0 -70590   0.2968 0.5859086    
## - NoYear                    1    0.2929 1858.4 -70589   4.1826 0.0408502 *  
## - DQ2yrs                    1    0.7892 1858.9 -70582  11.2694 0.0007891 ***
## - Region                    4    1.3798 1859.5 -70580   4.9254 0.0005736 ***
## - PublicRec                 1    1.1449 1859.2 -70577  16.3478 5.286e-05 ***
## - HomeOwnership             2    1.5686 1859.6 -70573  11.1988 1.375e-05 ***
## - MonthlyIncome             1    1.7260 1859.8 -70569  24.6457 6.933e-07 ***
## - RevolvingLineUtilization  1    2.9887 1861.0 -70551  42.6756 6.578e-11 ***
## + MonthsSinceDQ            93    6.4687 1851.6 -70498   0.9932 0.4992160    
## - Inquiries6M               1    8.6255 1866.7 -70470 123.1626 < 2.2e-16 ***
## - LoanPurpose              13   15.7240 1873.8 -70394  17.2708 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-70593.05
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingCREDITBalance + 
##     RevolvingLineUtilization + Inquiries6M + DQ2yrs + PublicRec + 
##     appl_fico_band + LoanPurpose + NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## - RevolvingCREDITBalance    1    0.0375 1858.1 -70595   0.5358 0.4641817    
## - appl_fico_band            2    0.2091 1858.3 -70594   1.4928 0.2247554    
## - DTI                       1    0.1209 1858.2 -70593   1.7257 0.1889764    
## <none>                                  1858.1 -70593                       
## + EmploymentLength          1    0.0299 1858.1 -70591   0.4271 0.5134082    
## + TotalCREDITLines          1    0.0192 1858.1 -70591   0.2739 0.6007562    
## - NoYear                    1    0.2654 1858.4 -70591   3.7895 0.0515871 .  
## - DQ2yrs                    1    0.7893 1858.9 -70584  11.2712 0.0007883 ***
## - Region                    4    1.3820 1859.5 -70581   4.9335 0.0005653 ***
## - PublicRec                 1    1.1739 1859.3 -70578  16.7623 4.249e-05 ***
## - HomeOwnership             2    1.5414 1859.6 -70575  11.0049 1.670e-05 ***
## - MonthlyIncome             1    1.7222 1859.8 -70570  24.5922 7.128e-07 ***
## - RevolvingLineUtilization  1    2.9947 1861.1 -70552  42.7620 6.294e-11 ***
## + MonthsSinceDQ            93    6.4548 1851.6 -70499   0.9910 0.5049997    
## - Inquiries6M               1    8.6280 1866.7 -70472 123.2007 < 2.2e-16 ***
## - LoanPurpose              13   15.7306 1873.8 -70395  17.2784 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-70594.51
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingLineUtilization + 
##     Inquiries6M + DQ2yrs + PublicRec + appl_fico_band + LoanPurpose + 
##     NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## - appl_fico_band            2    0.2007 1858.3 -70596   1.4331 0.2385937    
## <none>                                  1858.1 -70595                       
## - DTI                       1    0.1558 1858.3 -70594   2.2241 0.1358835    
## - NoYear                    1    0.2346 1858.4 -70593   3.3504 0.0672011 .  
## + RevolvingCREDITBalance    1    0.0375 1858.1 -70593   0.5358 0.4641817    
## + EmploymentLength          1    0.0322 1858.1 -70593   0.4603 0.4974932    
## + TotalCREDITLines          1    0.0111 1858.1 -70593   0.1584 0.6906446    
## - DQ2yrs                    1    0.7701 1858.9 -70586  10.9971 0.0009138 ***
## - Region                    4    1.3810 1859.5 -70583   4.9300 0.0005688 ***
## - PublicRec                 1    1.1429 1859.3 -70580  16.3193 5.367e-05 ***
## - HomeOwnership             2    1.5051 1859.6 -70577  10.7460 2.163e-05 ***
## - MonthlyIncome             1    1.7076 1859.8 -70572  24.3840 7.940e-07 ***
## - RevolvingLineUtilization  1    3.3391 1861.5 -70549  47.6800 5.131e-12 ***
## + MonthsSinceDQ            93    6.4711 1851.7 -70501   0.9936 0.4981121    
## - Inquiries6M               1    8.5980 1866.7 -70474 122.7741 < 2.2e-16 ***
## - LoanPurpose              13   15.7025 1873.8 -70397  17.2478 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-70595.64
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingLineUtilization + 
##     Inquiries6M + DQ2yrs + PublicRec + LoanPurpose + NoYear
## 
##                            Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>                                  1858.3 -70596                       
## - DTI                       1    0.1560 1858.5 -70595   2.2269 0.1356385    
## + appl_fico_band            2    0.2007 1858.1 -70595   1.4331 0.2385937    
## + EmploymentLength          1    0.0308 1858.3 -70594   0.4393 0.5074753    
## + RevolvingCREDITBalance    1    0.0292 1858.3 -70594   0.4162 0.5188299    
## + TotalCREDITLines          1    0.0124 1858.3 -70594   0.1764 0.6745051    
## - NoYear                    1    0.2761 1858.6 -70594   3.9425 0.0470902 *  
## - DQ2yrs                    1    0.8710 1859.2 -70585  12.4363 0.0004218 ***
## - Region                    4    1.3960 1859.7 -70584   4.9835 0.0005162 ***
## - PublicRec                 1    1.2161 1859.5 -70580  17.3646 3.095e-05 ***
## - HomeOwnership             2    1.5449 1859.9 -70578  11.0297 1.629e-05 ***
## - MonthlyIncome             1    1.7210 1860.1 -70573  24.5743 7.194e-07 ***
## - RevolvingLineUtilization  1    4.1234 1862.5 -70539  58.8772 1.736e-14 ***
## + MonthsSinceDQ            93    6.4157 1851.9 -70502   0.9850 0.5214999    
## - Inquiries6M               1    8.5317 1866.9 -70476 121.8230 < 2.2e-16 ***
## - LoanPurpose              13   15.6809 1874.0 -70398  17.2236 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Feature selection:

Stepwise selection was used to select significant predictors for the logistic regression. Out of the 17 predictors, 9 were selected as significant for the prediction at 0.05 significant level. However, after modeling the logistic regression with the 9 predictors, the summary statistics from the model showed that NoYear, DQ2yrs and HomeOwnership were insignificant in the prediction. These predictors were removed and the model built with the remaining 6 predictors. The numeric variables were binned with BinProfet () function in Rprofet package and smoothed or pruned with WOE_custom () function also part of Rprofet package.

df_subset<-select(df_loan,LOAN_ID,NoYear,DQ2yrs,Region,PublicRec,HomeOwnership,MonthlyIncome,RevolvingLineUtilization,Inquiries6M,LoanPurpose,target)
bin_vars<-Rprofet::BinProfet(dat=df_subset,'LOAN_ID',varcol = c(2:3,5,7,8,9),target = 'target')#Bucketing the predictors with the function BinProfet before model building

#Plots of binned predictors
bin_plots<-function(x,y){
  Rprofet::WOEplotter(dat=bin_vars,var = x,target=y) 
}
bin_plots('MonthlyIncome_Bins','target')#Generating plots for the binned predictors to check if there is the need to prune any before they enter the model

bin_plots('NoYear_Bins','target')

bin_plots('RevolvingLineUtilization_Bins','target')

bin_plots('DQ2yrs_Bins','target')

bin_plots('Inquiries6M_Bins','target')

bin_plots('PublicRec_Bins','target')

#Plots of the smoothed/pruned numeric predictors
df_subset$DQ2yrs<-Rprofet::WOE_custom(df_subset,'DQ2yrs', target='target', breaks=c(0,0.2,0.5,2.5,Inf), right_bracket = F, color = "#A4AA83")

df_subset$RevolvingLineUtilization<-Rprofet::WOE_custom(df_subset,'RevolvingLineUtilization', target='target', breaks=c(0,25,42,55,65,75,87,Inf), right_bracket = F, color = "#A4AA83")

df_subset$Inquiries6M<-Rprofet::WOE_custom(df_subset,'Inquiries6M', target='target', breaks=c(0,0.02,1,2,Inf), right_bracket = F, color = "#A4AA83")

df_subset$MonthlyIncome<-Rprofet::WOE_custom(df_subset,'MonthlyIncome', target='target', breaks=c(-Inf,3.2e+03,4.2e+03,5.5e+03,7.6e+03,Inf), right_bracket = F, color ="#A4AA83")

df_subset$NoYear<-Rprofet::WOE_custom(df_subset,'NoYear', target='target', breaks=c(-Inf,7,10,12,15,20,Inf), right_bracket = F, color = "#A4AA83")

df_subset$PublicRec<-Rprofet::WOE_custom(df_subset,'PublicRec', target='target', breaks=c(0,0.001,0.05,Inf), right_bracket = F, color = "#A4AA83")

logit_predictors<-df_subset[,-c(1:3,6,11)]#Preparing the binned predictors for the logistic model by removing the DebtDimId and the target "Bad" from the dataset
logit_target<-df_subset[,11] #Setting the target variable aside from the dataset

set.seed(551) #Randomly spliiting the dataset into 40% test data and 60% train data
trainingRows<-caret::createDataPartition(logit_target,p=.6,list=F)
trainX<-logit_predictors[trainingRows,]
trainY<-logit_target[trainingRows]
testX<-logit_predictors[-trainingRows,]
testY<-logit_target[-trainingRows]

logit_Fit<-glm(trainY ~.,data=trainX,family = "binomial")  #Logistic model with the function glm performed on the train data
summary(logit_Fit)#summary statistics from the logistic model

## 
## Call:
## glm(formula = trainY ~ ., family = "binomial", data = trainX)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0168  -0.4416  -0.3690  -0.2972   2.8452  
## 
## Coefficients:
##                                  Estimate Std. Error z value
## (Intercept)                      -3.32230    0.23956 -13.868
## RegionNorthEast                   0.01563    0.10225   0.153
## RegionSouthEast                   0.17986    0.10242   1.756
## RegionSouthWest                   0.07452    0.12448   0.599
## RegionWest                        0.27003    0.09890   2.730
## PublicRec[0.05,Inf)               0.24733    0.10920   2.265
## MonthlyIncome[3.2e+03,4.2e+03)   -0.24593    0.08696  -2.828
## MonthlyIncome[4.2e+03,5.5e+03)   -0.34352    0.08724  -3.938
## MonthlyIncome[5.5e+03,7.6e+03)   -0.53836    0.09033  -5.960
## MonthlyIncome[7.6e+03, Inf)      -0.98129    0.10370  -9.463
## RevolvingLineUtilization[25,42)  -0.08388    0.11891  -0.705
## RevolvingLineUtilization[42,55)   0.14620    0.11591   1.261
## RevolvingLineUtilization[55,65)   0.23035    0.11916   1.933
## RevolvingLineUtilization[65,75)   0.32329    0.11623   2.781
## RevolvingLineUtilization[75,87)   0.46605    0.11058   4.215
## RevolvingLineUtilization[87,Inf)  0.57645    0.10849   5.313
## Inquiries6M[1,2)                  0.31922    0.07292   4.378
## Inquiries6M[2,Inf)                0.61238    0.07249   8.448
## LoanPurposecredit_card            0.09380    0.23406   0.401
## LoanPurposedebt_consolidation     0.53804    0.21504   2.502
## LoanPurposeeducational            1.43166    0.42890   3.338
## LoanPurposehome_improvement       0.55735    0.24253   2.298
## LoanPurposehouse                  0.64113    0.35989   1.781
## LoanPurposemajor_purchase         0.54771    0.25221   2.172
## LoanPurposemedical                0.94108    0.29444   3.196
## LoanPurposemissing                0.79411    0.22794   3.484
## LoanPurposemoving                 0.63631    0.31707   2.007
## LoanPurposerenewable_energy       1.33166    0.49570   2.686
## LoanPurposesmall_business         1.47323    0.23417   6.291
## LoanPurposevacation               0.79828    0.35463   2.251
## LoanPurposewedding                0.30707    0.31306   0.981
##                                              Pr(>|z|)    
## (Intercept)                      < 0.0000000000000002 ***
## RegionNorthEast                              0.878523    
## RegionSouthEast                              0.079073 .  
## RegionSouthWest                              0.549394    
## RegionWest                                   0.006328 ** 
## PublicRec[0.05,Inf)                          0.023520 *  
## MonthlyIncome[3.2e+03,4.2e+03)               0.004685 ** 
## MonthlyIncome[4.2e+03,5.5e+03)         0.000082265828 ***
## MonthlyIncome[5.5e+03,7.6e+03)         0.000000002524 ***
## MonthlyIncome[7.6e+03, Inf)      < 0.0000000000000002 ***
## RevolvingLineUtilization[25,42)              0.480573    
## RevolvingLineUtilization[42,55)              0.207203    
## RevolvingLineUtilization[55,65)              0.053214 .  
## RevolvingLineUtilization[65,75)              0.005413 ** 
## RevolvingLineUtilization[75,87)        0.000025009424 ***
## RevolvingLineUtilization[87,Inf)       0.000000107660 ***
## Inquiries6M[1,2)                       0.000012003777 ***
## Inquiries6M[2,Inf)               < 0.0000000000000002 ***
## LoanPurposecredit_card                       0.688595    
## LoanPurposedebt_consolidation                0.012346 *  
## LoanPurposeeducational                       0.000844 ***
## LoanPurposehome_improvement                  0.021556 *  
## LoanPurposehouse                             0.074833 .  
## LoanPurposemajor_purchase                    0.029881 *  
## LoanPurposemedical                           0.001392 ** 
## LoanPurposemissing                           0.000494 ***
## LoanPurposemoving                            0.044765 *  
## LoanPurposerenewable_energy                  0.007222 ** 
## LoanPurposesmall_business              0.000000000315 ***
## LoanPurposevacation                          0.024385 *  
## LoanPurposewedding                           0.326662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8817.3  on 15937  degrees of freedom
## Residual deviance: 8474.1  on 15907  degrees of freedom
## AIC: 8536.1
## 
## Number of Fisher Scoring iterations: 5

#NoYear appears insignificant at 0.5 sig. level, hence removed from the data

The summary statistics of the logistic regression model shows that, at 5% significant level, borrowers or customers from the West region stands a low chance of defaulting on their loan payment whereas those from NorthEast, SouthEast and SouthWest have a high chance of defaulting. Increase in monthly income of borrowers decreases the probability that they will default on their loan payment. It is also observed that increase in revolving line utilization of borrowers decrease the probability that a borrower will default. Customers that borrows loans for small business, renewable energy, moving purpose, medical expenses, home improvement, education and debt consolidation usually have low probability of defaulting on their loan payment.

logitTrain_pred<-predict(logit_Fit,trainX,type = "response")#validating the model with the train data
df_logitTrain_pred<-data.frame(obs=trainY,
                               pred=logitTrain_pred)
caret::defaultSummary(df_logitTrain_pred)#The Root Means Square Error (RMSE) and Mean Absolute Error associated with the model

##       RMSE   Rsquared        MAE 
## 0.26689210 0.02236349 0.14236016

logit_pred<-predict(logit_Fit,testX,type = "response")#validating the model with the test data
logit_values<-data.frame(obs=testY,
                         pred=logit_pred)#The Root Means Square Error (RMSE) and Mean Absolute Error associated with the model
caret::defaultSummary(logit_values)

##       RMSE   Rsquared        MAE 
## 0.26051395 0.01636385 0.13855451

The logistic regression model was built on the train data using the glm () function and the model performance evaluated on both the train and test data. The performance of the model on both the train and test data appears statistically significant as observed by its low RMSE value on both the train (0.26689210) and test data (0.26051395). This was also true when their MAE (0.14236016 and 0.13855451 for the train and test data, respectively) were computed. ROC curves for both the train and test data together with the area under each curve (auc) were computed. As shown in figure 23 the model’s performance rate on both the train and test data appears statistically significant as the value of the Area under their ROC curve (0.65 and 0.63 for the train and test data, respectively) is close to 1. Their computed AUROC curves are comparable, indicating the model do not overfit the test data.

df_LogisticTrain_pred<-tibble(trainY,logitTrain_pred)#Generating the scoring data for the logistic model on the train data
sort_df_LogisticTrain_pred<-(df_LogisticTrain_pred[order(df_LogisticTrain_pred[,2],decreasing = T),])
#DT::datatable(df_marsTrain_pred,filter = "top",caption="Table.2: Predicted verses Actual values in the train Data (Logistic Model)")

logit_results<-tibble(testY,logit_pred)#Generating the scoring data for the logistic model
sort_logit_pred<-(logit_results[order(logit_results[,2],decreasing = T),])
#DT::datatable(logit_results,filter = "top",caption="Table.1: Predicted verses Actual values in the test data (Logistic Model)")
#write.csv(sort_logit_pred,"Logistic_Binned.csv")

if(require('ROCR')){logitTrain_prediction<-prediction(sort_df_LogisticTrain_pred$logitTrain_pred,sort_df_LogisticTrain_pred$trainY)}
if(require('ROCR')){logitTrain_perf<-performance(logitTrain_prediction,"tpr","fpr")}

if(require(ROCR)){logit_prediction<-prediction(sort_logit_pred$logit_pred,sort_logit_pred$testY)}
if(require('ROCR')){logit_perf<-performance(logit_prediction,"tpr","fpr")}

logitTrain_auc<-performance(logitTrain_prediction, measure = "auc")   #Computes the area under the ROC curve     
print(paste(logitTrain_auc@y.values,"LOGISTIC TRAIN AUC"))

## [1] "0.650555279784403 LOGISTIC TRAIN AUC"

logitTest_auc<-performance(logit_prediction, measure = "auc")   #Computes the area under the ROC curve     
print(paste(logitTest_auc@y.values,"LOGISTIC TEST AUC"))

## [1] "0.626715324168999 LOGISTIC TEST AUC"

plot(unlist(slot(logitTrain_perf,"x.values")),unlist(slot(logitTrain_perf,"y.values")),
     ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="green",main="Receiver Operating Characteristic (ROC) Plot on Logistic Model")

points(unlist(slot(logit_perf,"x.values")),unlist(slot(logit_perf,"y.values")),
       ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="blue")

abline(a=0,b=1,col="red")
legend("topleft", legend=c("Train_AUC","Test_AUC","Baseline_AUC"),
       col=c("green","blue", "red"), lty=1:3, cex=0.6)

legend("bottomright", legend=c("LogitTrain_AUC=0.65","LogitTest_AUC=0.63"),
       col=c("green","blue"), lty=1:2, cex=0.6)

Figure 23: Receiver Operating Characteristic (ROC) plot on the Logistic model

if(require(gains)){logit_Gains<-gains(testY,logit_pred)}#Computing gains for the model prediction on the test data
logit_Gains

## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##   10  1064   1064      0.15      0.15      20.4%     203    203      0.17
##   20  1060   2124      0.11      0.13      34.5%     142    173      0.12
##   30  1072   3196      0.10      0.12      48.5%     139    161      0.10
##   40  1054   4250      0.06      0.11      56.8%      83    142      0.08
##   50  1062   5312      0.06      0.10      64.9%      81    130      0.07
##   60  1062   6374      0.07      0.09      74.5%      96    124      0.07
##   70  1065   7439      0.07      0.09      83.3%      88    119      0.06
##   80  1060   8499      0.05      0.08      89.6%      63    112      0.05
##   90  1064   9563      0.04      0.08      95.1%      54    106      0.04
##  100  1061  10624      0.04      0.07     100.0%      49    100      0.03

From figure ?? and the gains table above, it is observed that depth of file of 50 which has cumulative observation of 5312 out of the 10624 has significant cumulative percentage of 64.9% out of the total response. This implies that in case of predicting customers with the probability of having bad loans or defaulting, if we target the first 50% of the customers (5312) from the predictions made by the model, 64.9% percent of them will default on their loan payment.

par(mfrow=c(1,3))#A plot of the Mean Response and the Deciles 
plot(logit_Gains,ylim = c(0,0.1),xlim=c(0,100))

barplot(logit_Gains$mean.resp, names.arg = logit_Gains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
        ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(logit_Gains$cume.pct.of.total, names.arg = logit_Gains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
        ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")#A barplot of the cummulative percentage of the total response vs the deciles

Figure 24: Receiver Operating Characteristic (ROC) plot on the Logistic model

COMPARISON BEWTEEN MARS AND LOGISTIC MODEL

Area Under the ROC curve (AUC) for the logistic model was higher compared to the MARS model. However, the MARS model performs better on the gains table. Kolmogorov-Smirnov (KS) curves measure model fit. KS curves consist of two lines—one for TPR and one for FPR. The y-values are the rates at a percentile and the percentiles are along the x-axis. The further apart the two lines are, the better. This indicates that the model was able to separate the “1s” from the “0s”. The KS statistic is the largest difference between the TPR and FPR at a given percentile. A perfect model would be 1 and random selection would be 0. From figure 26, the logistic model performs better compared to the MARS model.

plot(unlist(slot(testperf, 'x.values')),unlist(slot(testperf,'y.values')),ylab="True Positive Rate",
     xlab="False Positive Rate",type="l",col="blue",main="AUROC for Logistics and MARS Model")
points(unlist(slot(logit_perf,"x.values")),unlist(slot(logit_perf,"y.values")),type="l",col="red")
abline(a=0,b=1,col="brown")
legend("topleft", legend=c("MARS Model", "Logistic Model","Random Model"),
       col=c("blue","red","brown"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("MARS Model=0.62", "Logistic Model=0.63"),
       col=c("blue", "red"), lty=1:2, cex=0.6) #Comparing the the predictive power of MARS Model and the Logistic

Figure 25: AUROC for Logistics and MARS Model

                                              #Model by computing their AUROC curves

#KS-Plot on MARS Prediction
par(mfrow=c(1,2))
KS_mars<-as.data.frame(cbind(testperf@x.values[[1]], testperf@y.values[[1]]))
Percentile <- NULL
Difference <- NULL 
for (i in 1:nrow(KS_mars)){
  KS_mars[i, 3] = i/nrow(KS_mars)
  KS_mars[i, 4]= abs(KS_mars[i,2]-KS_mars[i,1])
}
colnames(KS_mars) <- c("FPR", "TPR", "Percentile", "Difference")
plot(KS_mars$Percentile, KS_mars$TPR, type = "l", col = "blue", 
     main = "KS Chart for MARS Prediction", 
     ylab = "TPR/FPR", 
     xlab = "Percentile")
points(KS_mars$Percentile, KS_mars$FPR, type = "l", col = "red")
abline(0,1, lty =2)
abline(v=0.49, lty = 2)

#KS-Plot on Logistic Prediction
KS_logit<- as.data.frame(cbind(logit_perf@x.values[[1]], logit_perf@y.values[[1]]))
Percentile <- NULL
Difference <- NULL 
for (i in 1:nrow(KS_logit)){
  KS_logit[i, 3] = i/nrow(KS_logit)
  KS_logit[i, 4]= abs(KS_logit[i,2]-KS_logit[i,1])
}
colnames(KS_logit) <- c("FPR", "TPR", "Percentile", "Difference")
plot(KS_logit$Percentile, KS_logit$TPR, type = "l", col = "blue", 
     main = "KS Chart for Logistic Prediction", 
     ylab = "TPR/FPR", 
     xlab = "Percentile")
points(KS_logit$Percentile, KS_logit$FPR, type = "l", col = "red")
abline(0,1, lty =2)
abline(v=0.3649, lty = 2)

Figure 26: KS-Charts on the MARS and Logistic model

CONCLUSION

The exploratory analysis shows that credit score is the ultimate flag to determine if a customer requesting for a load will default on the loan payment or not. However, this predictor variable was not selected by either of the models as significant in the prediction process. From the analysis, majority of good loan customers have good credit score compared to bad loan customers and few have very good credit score which is significantly higher than the bad loan customers. Customers grouped by home ownership shows that most borrowers that rent or are on mortgage do not default on their loan payment. Among borrowers who own their own homes, most of them do not default on their loan payment. Number of months for payments of a loan do not show whether a customer will default or not on their loan payment. Debt Consolidation is the most common purpose of loans. But the proportion is different among different types of home ownership. It was observed that the amount of loans requested in high standard of living states such as California, Florida, New York, Texas and Maryland are significantly high compared to states with low standard of living.

The model performance of the MARS and logistic models are comparable in terms of predictive power. The ROC curves and AUC values showed that neither model suffered from overfitting, however, the logistic model appeared better in its prediction. While the MARS model had better values for the gains table, its KS statistic was not comparable to the logistic model. Hence, the logistic model will better performance in predicting new customers with the probability of defaulting on their loan payment. Factors that were common among the two models in predicting Bad loan include monthly income, revolving line utilization, number of inquires a customer had made for new credit in the last 6 months and the purpose of the loan. There is therefor the need for investors to look out for borrowers with high monthly income, little to no credit inquires and have low revolving credit line utilization. Purpose for a particular loan is another important factor which should be considered especially with borrowers requesting money wedding.

PREDICTING LOAN DEFAULTS WITH PREDICTIVE MODELS

Samuel Adjei

5/14/2021