DATA SUMMARY
The dataset used in this analysis is from a website www.lendingclub.com. It contains data on real loans by people. It has 30 input variables and 26,562 observations. The variables describe the credit characteristics and loan application information from loan applicants. It also includes a response variable called ‘target’ which has two levels, 1 for bad and 0 for a good loan. Bad is defined as being 30 or more days past due 13 months into the loan. This is used as a ‘surrogate’ for the loan defaulting
library(ggplot2)
library(tidyverse)
setwd("D:/R-STUDIO/STAT 515/Final_Project") #Reading the data into R
df_loan<-read.csv(file="LendingClub2021.csv",na.strings = c("","NA"))
dplyr::glimpse(df_loan)
## Rows: 26,562
## Columns: 31
## $ LOAN_ID <int> 290807, 347648, 348340, 357509, 362601, 36359~
## $ MOB <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ InterestRate <dbl> 15.70, 12.73, 11.12, 10.25, 13.23, 10.38, 9.8~
## $ IssuedDate <chr> "2/22/2010", "2/19/2010", "7/28/2010", "3/30/~
## $ DTI <dbl> 20.44, 20.64, 19.72, 11.62, 22.79, 11.57, 11.~
## $ State <chr> "NY", "CA", "NY", "CA", "KY", "GA", "NY", "MO~
## $ HomeOwnership <chr> "RENT", "RENT", "RENT", "RENT", "RENT", "MORT~
## $ MonthlyIncome <dbl> 7083.00, 5000.00, 5916.67, 5000.00, 3750.00, ~
## $ EarliestCREDITLine <chr> "1/30/1995", "3/4/2005", "1/13/2001", "11/12/~
## $ OpenCREDITLines <int> 16, 8, 11, 13, 11, 16, 6, 13, 3, 11, 4, 4, 18~
## $ TotalCREDITLines <int> 32, 18, 18, 22, 26, 24, 24, 39, 7, 22, 7, 17,~
## $ RevolvingCREDITBalance <int> 28511, 3353, 20736, 5419, 5738, 10516, 7875, ~
## $ RevolvingLineUtilization <dbl> 56.7, 54.1, 53.1, 31.1, 43.8, 19.5, 25.8, 65.~
## $ Inquiries6M <int> 1, 0, 0, 2, 1, 2, 3, 0, 0, 1, 0, 3, 3, 0, 0, ~
## $ AccountsDQ <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ DelinquentAmount <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ DQ2yrs <int> 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, ~
## $ MonthsSinceDQ <int> 55, 999, 999, 999, 2, 999, 61, 22, 999, 44, 1~
## $ PublicRec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ MonthsSinceLastRec <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ Education <chr> "Marist College,Xavier High School", NA, NA, ~
## $ EmploymentLength <chr> "< 1 year", "1 year", "8 years", "1 year", "2~
## $ currentpolicy <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ term <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 3~
## $ appl_fico_band <chr> "670-674", "710-714", "710-714", "715-719", "~
## $ vintage <chr> "10Q1", "10Q1", "10Q3", "10Q1", "10Q3", "10Q3~
## $ TERM <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 3~
## $ MFMonth <chr> "0010-03-01", "0010-03-01", "0010-09-01", "00~
## $ target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Amount.Requested <int> 15000, 10000, 8000, 7500, 4000, 3000, 10000, ~
## $ LoanPurpose <chr> "debt_consolidation", "house", "credit_card",~
DATA PRE-PROCESSING
The input variable “LoanPurpose” had missing observations. These observations were replaced with the string “missing”. Also, the variable “MonthsSinceDQ” has missing observations recorded as 999. This value serves as flag for missing observations, hence, were also replaced with the string “missing”. After computing the percentage of missing observations in all the variables, it was observed that the variable “MonthsSinceLastRec” and “Education” have 97.2% (25820 observations) and 98.8% (26254 observations) percent respectively, missing observations. These two variables were removed from the dataset. Other variables considered as info variables such as vintage, MFMonth, Amount Requested, TERM, term and interest rate which were removed from the dataset before the model building process. These variables appear not having significant contribution in the prediction. Some of the variables with string values were re-code. Among these include state, and EmploymentLength. Also, appl_fico_band which had ranged observations was re-code. The difference between IssuedDate and EarliestCREDITLine was computed as NoYear representing the number of years a person has been loaned since his/her first known credit line. IssuedDate and EarliestCREDITLine were also removed from the data after their difference was computed. HomeOwwnership has 4 levels with one level (OTHER) having only two values. This presented a little issue during the model building process, hence, the level was RENT which is the third level of the variable. The final dataset after the processing stage had 26,562 rows with 15 columns.
SmartEDA::ExpData(df_loan,2)#Computing percentage of missing values in each variable in the data
## Index Variable_Name Variable_Type Sample_n Missing_Count
## 1 1 LOAN_ID integer 26562 0
## 2 2 MOB integer 26562 0
## 3 3 InterestRate numeric 26562 0
## 4 4 IssuedDate character 26562 0
## 5 5 DTI numeric 26562 0
## 6 6 State character 26562 0
## 7 7 HomeOwnership character 26562 0
## 8 8 MonthlyIncome numeric 26562 0
## 9 9 EarliestCREDITLine character 26562 0
## 10 10 OpenCREDITLines integer 26562 0
## 11 11 TotalCREDITLines integer 26562 0
## 12 12 RevolvingCREDITBalance integer 26562 0
## 13 13 RevolvingLineUtilization numeric 26562 0
## 14 14 Inquiries6M integer 26562 0
## 15 15 AccountsDQ integer 26562 0
## 16 16 DelinquentAmount integer 26562 0
## 17 17 DQ2yrs integer 26562 0
## 18 18 MonthsSinceDQ integer 26562 0
## 19 19 PublicRec integer 26562 0
## 20 20 MonthsSinceLastRec integer 742 25820
## 21 21 Education character 308 26254
## 22 22 EmploymentLength character 26562 0
## 23 23 currentpolicy integer 26562 0
## 24 24 term integer 26562 0
## 25 25 appl_fico_band character 26562 0
## 26 26 vintage character 26562 0
## 27 27 TERM integer 26562 0
## 28 28 MFMonth character 26562 0
## 29 29 target integer 26562 0
## 30 30 Amount.Requested integer 26562 0
## 31 31 LoanPurpose character 24120 2442
## Per_of_Missing No_of_distinct_values
## 1 0.000 26562
## 2 0.000 1
## 3 0.000 201
## 4 0.000 527
## 5 0.000 2750
## 6 0.000 44
## 7 0.000 4
## 8 0.000 3835
## 9 0.000 8546
## 10 0.000 37
## 11 0.000 78
## 12 0.000 17440
## 13 0.000 1067
## 14 0.000 9
## 15 0.000 1
## 16 0.000 1
## 17 0.000 11
## 18 0.000 94
## 19 0.000 5
## 20 0.972 85
## 21 0.988 278
## 22 0.000 12
## 23 0.000 1
## 24 0.000 2
## 25 0.000 34
## 26 0.000 9
## 27 0.000 2
## 28 0.000 27
## 29 0.000 2
## 30 0.000 838
## 31 0.092 13
df_loan$MonthsSinceDQ<-ifelse(df_loan$MonthsSinceDQ==999,"missing",df_loan$MonthsSinceDQ)#Replacing 999 in MonthsSinceDQ with missing
df_loan$LoanPurpose<-ifelse(is.na(df_loan$LoanPurpose),'missing',df_loan$LoanPurpose) #Replacing NA's in LoanPurpose with missing
df_loan$NoYear<-as.numeric(format(as.Date(df_loan$IssuedDate,
format = "%m/%d/%Y"),"%Y"))-as.numeric(format(as.Date(df_loan$EarliestCREDITLine,
format = "%m/%d/%Y"),"%Y")) #Computing the difference between IssuedDate and EarliestCREDITLine
#and creating a new column "NoYear" to contain the difference
df_loan<-df_loan[,-c(4,9,20:21,24,26,28)] #Dropping term (same as TERM),Education, MonthsSinceLastRec and other info variables such as MFMonth, vintage, and IssuedDate, EarliestCREDITLine from the data
#df_loan[,c(5:6,16,18:20,26)]<-lapply(df_loan[,c(5:6,16,18:20,26)], as.factor)
EXPLORATORY ANALYSIS
numeric_vars<-df_loan%>% #Selecting only the numeric variables in the data
select(MonthlyIncome,Amount.Requested,MOB,InterestRate,DTI,OpenCREDITLines,TotalCREDITLines,RevolvingCREDITBalance,
RevolvingLineUtilization,Inquiries6M,AccountsDQ,DelinquentAmount,DQ2yrs,PublicRec,currentpolicy,TERM,NoYear)
summary(numeric_vars)#Summary statistics on the numeric variables
## MonthlyIncome Amount.Requested MOB InterestRate DTI
## Min. : 340 Min. : 1000 Min. :1 Min. : 6.00 Min. : 0.00
## 1st Qu.: 3489 1st Qu.: 6000 1st Qu.:1 1st Qu.:11.12 1st Qu.: 9.13
## Median : 5000 Median :10612 Median :1 Median :13.48 Median :14.24
## Mean : 5841 Mean :12578 Mean :1 Mean :13.80 Mean :13.99
## 3rd Qu.: 7000 3rd Qu.:17500 3rd Qu.:1 3rd Qu.:15.96 3rd Qu.:19.16
## Max. :500000 Max. :35000 Max. :1 Max. :24.59 Max. :29.95
## OpenCREDITLines TotalCREDITLines RevolvingCREDITBalance
## Min. : 2.000 Min. : 3.00 Min. : 0
## 1st Qu.: 6.000 1st Qu.:13.00 1st Qu.: 4447
## Median : 9.000 Median :20.00 Median : 9878
## Mean : 9.275 Mean :21.84 Mean : 14033
## 3rd Qu.:12.000 3rd Qu.:29.00 3rd Qu.: 18052
## Max. :39.000 Max. :90.00 Max. :149588
## RevolvingLineUtilization Inquiries6M AccountsDQ DelinquentAmount
## Min. : 0.00 Min. :0.0000 Min. :0 Min. :0
## 1st Qu.:37.20 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0
## Median :59.60 Median :1.0000 Median :0 Median :0
## Mean :56.67 Mean :0.9136 Mean :0 Mean :0
## 3rd Qu.:78.90 3rd Qu.:2.0000 3rd Qu.:0 3rd Qu.:0
## Max. :99.90 Max. :8.0000 Max. :0 Max. :0
## DQ2yrs PublicRec currentpolicy TERM
## Min. : 0.0000 Min. :0.00000 Min. :1 Min. :36.00
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:1 1st Qu.:36.00
## Median : 0.0000 Median :0.00000 Median :1 Median :36.00
## Mean : 0.1769 Mean :0.06649 Mean :1 Mean :45.74
## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:1 3rd Qu.:60.00
## Max. :11.0000 Max. :4.00000 Max. :1 Max. :60.00
## NoYear
## Min. : 3.00
## 1st Qu.: 9.00
## Median :12.00
## Mean :13.56
## 3rd Qu.:17.00
## Max. :65.00
The summary statistics on the numeric variables shows that no customer has a delinquent amount which correlates with the account delinquent as there is none for the customers. It appears there is zero variance in MOB and current policy. It appears all the customers have been only a month long since the records were taken. It also appears the same credit policy is applied to all the customers. Some of the customers earn as low as 340 dollars and has high as 500,000 dollars monthly. However, the monthly income of most of the customers is 5841 dollars. The lowest amount of loan requested is 1000 dollars and the highest is 35,000 dollars. Interest rate can go as low as 6% and as high as 24.59%.
numeric_vars<-numeric_vars[,-c(3,11:12,15:16)]#Dropping MOB, AccountDQ, DelinquentAmount and currentpolicy
plotHist <- function(columns,bin,colours){
par(mfrow = c(3,4))#Histogram plots to visualize the distribution of the numeric variables in the data set.
for (i in columns) {
hist(numeric_vars[,i], main = paste("Histogram of ", names(numeric_vars)[i]),
nclass = bin, las = 1, col = colours,
xlab = paste(names(numeric_vars)[i]))
}
}
plotHist(c(1:12), rep(5,12), "turquoise2")
Figure 1: Histogram on the numeric variables
From figure 1, the distribution of the variable Amount.Requested, DTI and InterestRate, TotalCREDITLines and RevolvinggLineUtilization appears to be normally distributed. Most of the variables are not normally distributed. About 17,587 (66%) of the customers have 10 or less open credit lines whereas about 723 (2.7%) have 20 more open credit lines. The number of months for payment of a particular loan among the customers is usually 36 or 60 months with 60 months being the most common.
df_loan %>%
select(MonthlyIncome,Amount.Requested,InterestRate,DTI,OpenCREDITLines,
TotalCREDITLines,RevolvingCREDITBalance,RevolvingLineUtilization,
Inquiries6M,TERM,NoYear,target) %>%
mutate(target = recode_factor(target,
`0` = "Good",
`1` = "Bad" )) %>%
gather(key = "key",
value = "value",-target) %>%
#Distribution of the target variable (0=Good,1=Bad) among the numeric variables
#Since no customer is indebted or has any standing delinquent amount, this variable together with AccountsDQ were removed from the data
ggplot(aes(y = value)) +
geom_boxplot(aes(fill = target),
alpha = .6,
fatten = .7) +
labs(x = "",
y = "",
title = "Boxplots of Numeric Variables") +
scale_fill_manual(
values = c("turquoise2","turquoise4"),
name = "Target\nBad",
labels = c("Good", "Bad")) +
theme_bw() +
facet_wrap(~ key,
scales = "free",
ncol = 4)
Figure 2: Boxplot of numeric variables
In the figure above, it is observed that the proportion of the amount of loan requested by the two category of customers (Bad =“1” and Good =“0”) is equal. However, interest rate varies among these two groups, with that of Bad loan customers (customers likely to default) being higher than of Good loan customers (customers with low chance of defaulting). Revolving credit balance divided by total revolving credit limit (RevolvingLineUtilization) varies among Bad and Good loan customers with that of bad loan customers being slightly high. Number of months of loan payment for a loan do not affect whether a customer will default or not as the distribution is equal among the two groups of customers.
Bad<-df_loan[,-c(2,13:14,19)] %>% #Dropping MOB, AccountDQ, DelinquentAmount and currentpolicy
filter(target==1)
Good<-df_loan[,-c(2,13:14,19)] %>%
filter(target==0)
counts<-function(x,y,z){
n=nrow(x)
B_acc=nrow(y)
G_acc=nrow(Good)
P_B_acc=(B_acc/n)*100
P_G_acc=((G_acc)/n)*100
return(list(num_bad_account=B_acc,num_good_account=G_acc,
percent_bad_account=P_B_acc,percent_good_account=P_G_acc))
}
counts(df_loan,Bad,Good)
## $num_bad_account
## [1] 2052
##
## $num_good_account
## [1] 24510
##
## $percent_bad_account
## [1] 7.725322
##
## $percent_good_account
## [1] 92.27468
The total number of Bad loan and Good loan recorded is 2052 and 24510 representing approximately 7.7 and 92.3, respectively.
Bad %>% #Computing the summary statistics of amount requested, monthly income and the number of years a customer has been loaned among the bad accounts
select(Amount.Requested,MonthlyIncome,InterestRate,NoYear) %>%
summary()
## Amount.Requested MonthlyIncome InterestRate NoYear
## Min. : 1000 Min. : 500 Min. : 6.00 Min. : 3.00
## 1st Qu.: 5844 1st Qu.: 3082 1st Qu.:12.42 1st Qu.: 9.00
## Median :10800 Median : 4400 Median :14.65 Median :12.00
## Mean :12607 Mean : 5106 Mean :14.91 Mean :13.06
## 3rd Qu.:18000 3rd Qu.: 6118 3rd Qu.:17.08 3rd Qu.:16.00
## Max. :35000 Max. :70000 Max. :24.40 Max. :48.00
Good %>% #Computing the summary statistics of amount requested, monthly income and the number of years a customer has been loaned among the good accounts
select(Amount.Requested,MonthlyIncome,InterestRate,NoYear) %>%
summary()
## Amount.Requested MonthlyIncome InterestRate NoYear
## Min. : 1000 Min. : 340 Min. : 6.00 Min. : 3.0
## 1st Qu.: 6000 1st Qu.: 3500 1st Qu.:11.12 1st Qu.: 9.0
## Median :10600 Median : 5000 Median :13.35 Median :12.0
## Mean :12575 Mean : 5903 Mean :13.71 Mean :13.6
## 3rd Qu.:17500 3rd Qu.: 7083 3rd Qu.:15.70 3rd Qu.:17.0
## Max. :35000 Max. :500000 Max. :24.59 Max. :65.0
After computing summary statistics of amount of money requested, monthly income, interest rate and number of years on both Bad and Good loan customers, it appears the minimum and maximum amount of money requested by both groups is approximately equal with most of the amount requested centered around 12,500 dollar (fig. 5 & 6). The minimum monthly income observed among Bad loan customers is 500 dollars with 70,000 dollars being the maximum. This differs when compared to the Good loan customers with minimum of 340 dollars and maximum amount of 500,000 dollars. A plot with the frequency of customers with monthly income greater or equal to 20,000 dollars: figure 9 shows that among the two groups, majority of the Good loan customers have monthly income greater than 20,000 with most of them earning 25,000 monthly. Most of the Good customers have monthly income of approximately 5,900 which exceeds that of the Bad customers with approximately 800 dollars. Interest rate on loans requested appears to be approximately equal among the two groups with that of Bad loan customers being slightly high. It appears Good loan customers have handled money with the LendingClub longer than the Bad loan customers. The number of years customers in both group have been employed at their current jobs appears to be equal (fig. 3 & 4).
#Bar plot on Employment Length
ggplot(Bad, aes(x =EmploymentLength)) +
geom_bar(color = "white", fill = "#A4AA83", binwidth = .5) +
theme_minimal() +
labs(title = "Barplot for Employment Length on Bad Loan Customers", x = "Employment Length",
y = "Frequency")
Figure 3: Bar plot on Employment Length
ggplot(Good, aes(x =EmploymentLength)) +
geom_bar(color = "white", fill = "#A4AA83", binwidth = .5) +
theme_minimal() +
labs(title = "Barplot for Employment Length on Good Loan Customers", x = "Employment Length",
y = "Frequency")
Figure 4: Bar plot on Employment Length
#Histogram on Amount of Money Requested
ggplot(Bad, aes(x =Amount.Requested)) +
geom_histogram(color = "white", fill = "#A4AA83") +
theme_minimal() +
labs(title = "Histogram for Amount Requested by Bad Loan Customers", x = "Amount of Money Requested",
y = "Frequency") + xlim(0,35000)
Figure 5: Histogram of amount of money requested by group
ggplot(Good, aes(x =Amount.Requested)) +
geom_histogram(color = "white", fill = "#A4AA83") +
theme_minimal() +
labs(title = "Histogram for Amount Requested by Good Loan Customers", x = "Amount of Money Requested",
y = "Frequency") + xlim(0,35000)
Figure 6: Histogram of amount of money requested by group
#Histogram on Amount of Monthly Income
ggplot(Bad, aes(x = MonthlyIncome)) +
geom_histogram(color = "white", fill = "#A4AA83") +
theme_minimal() +
labs(title = "Histogram for Monthly Income on Bad Loan Customers", x = "Monthly Income",
y = "Frequency") + xlim(0,20000)
Figure 7: Histogram on monthly income by group
ggplot(Good, aes(x = MonthlyIncome)) +
geom_histogram(color = "white", fill = "#A4AA83") +
theme_minimal() +
labs(title = "Histogram for Monthly Income on Good Loan Customers", x = "Monthly Income",
y = "Frequency") +
xlim(0, 20000)
Figure 8: Histogram on monthly income by group
A<-Bad %>%
filter(MonthlyIncome >=20000) %>%
ggplot(aes(MonthlyIncome)) + geom_histogram(color="white",fill="#A4AA83") +
theme_minimal() + labs(title = "Monthly Income >= 20000 on Bad Account",
x="Monthly Income >= 20000",y="Frequency") + xlim(20000,70000)
B<-Good %>%
filter(MonthlyIncome >=20000) %>%
ggplot(aes(MonthlyIncome)) + geom_histogram(color="white",fill="#A4AA83") +
theme_minimal() + labs(title = "Monthly Income >= 20000 on Good Account",
x="Monthly Income >= 20000",y="Frequency") + xlim(20000,70000)
gridExtra::grid.arrange(A,B,ncol=2)
Figure 9: Monthly income >= 20000 by group
ggplot(df_loan, aes(y = HomeOwnership,
LoanPurpose)) + geom_count(color = "brown") + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
Figure 10: A plot of Number of Loans by HomeOwnership and Loan Purpose, and Interest Rate vs Loan Purpose
ggplot(df_loan, aes(y = InterestRate,
LoanPurpose,fill=LoanPurpose)) + geom_violin(trim = FALSE) +
#A plot showing distribution of Amount of loan requested for the different levels in LoanPurpose
geom_boxplot(width = 0.1) + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
Figure 11: A plot of Number of Loans by HomeOwnership and Loan Purpose, and Interest Rate vs Loan Purpose
Figure 11 shows the number of loans grouped by Home Ownership and the purpose of the loan. Debt Consolidation is the most common purpose of loans. But the proportion is different among different types of home ownership. Most mortgage and rent owners have the highest number of loans for Debt Consolidation and credit cards. Individuals with own homes do not really apply for either of the loan purpose given. Interest rate on loans requested for cars, credit card, debt consolidation and education are significantly higher compared to the other loan purpose with that of cars being highest. However, the interest rate of loans on cars and education are significantly low compared to small business which is not often requested.
ggplot(df_loan, aes(x =LoanPurpose, y = Amount.Requested, fill = LoanPurpose)) +
geom_violin(trim = FALSE) + #A plot showing distribution of Amount of loan requested for the different levels in LoanPurpose
geom_boxplot(width = 0.1)
Figure 12: A plot of Amount of Loan Requested grouped by Loan Purpose
sta.income <-df_loan%>% mutate(Fico_Score=ifelse(appl_fico_band<740,"Good",
ifelse(appl_fico_band>800,"Excellent","Very Good")),
IncomeType = factor(ifelse(MonthlyIncome < 4000, "Under4000",
ifelse(MonthlyIncome < 6000 & MonthlyIncome >= 4000, "4000-6000",
ifelse(MonthlyIncome < 8000 & MonthlyIncome >= 6000, "6000-8000",
ifelse(MonthlyIncome < 10000 & MonthlyIncome >= 8000, "8000-10000",
"Above10000"))))),
)
ggplot(sta.income, aes(Fico_Score,InterestRate,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+
theme()+labs(title="Box plot of Interest rate vs Fico_Score")
Figure 13: Box plot of Interest rate vs Fico_Score
Figure 13 shows that as credit score of a customer increases, the interest rate on a loan requested by the customer becomes deceases which makes sense as these loans would most likely be less risky. This is also observed in figure 14 . Interest rate charged on amount of loan requested by a customer is largely dependent on the credit score of that customer. Customers with excellent credit score requesting significantly high amount of loans turns to have low interest rate compared to customers with good credit score but requesting low amount. In respective of the income type, customers with excellent credit score requests significantly high amount of loans compared to other groups: 15 .There is no significant association between amount of loan requested and interest rate: figure 16 .
ggplot(sta.income, aes(Amount.Requested,InterestRate,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+
theme()+labs(title="Box plot of Amount of Loan Requested vs Interest Rate")
Figure 14: Box plot of Amount of Loan Requested vs Interest Rate
ggplot(sta.income, aes(IncomeType,Amount.Requested,fill=Fico_Score))+geom_boxplot(outlier.color = "blue")+
theme()+labs(title="Box plot of Amount of Loan Requested vs IncomeType")
Figure 15: Box plot of Amount of Loan Requested vs IncomeType
ggplot(sta.income, aes(Amount.Requested, InterestRate)) + geom_point() + geom_smooth() + xlab("Amount of Loan Requested") +
ylab("InterestRate")+labs(title="Scatterplot of Interest Rate Vs. Inquiries")
Figure 16: Scatterplot of Interest Rate Vs. Inquiries
ggplot(sta.income, aes(x = State, y = Amount.Requested)) +
geom_point(size=3, color="turquoise4") +
geom_segment(aes(x = State, xend = State, y = 0, yend = Amount.Requested)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+labs(title="Loppipop plot of Amount of Loan Requested in each State")
Figure 17: Loppipop plot of Amount of Loan Requested in each State
Figure 17 shows the amount of money requested in each state. It is observed that the amount of money requested in California, Florida, New York, Texas and Maryland are significantly high compared to other states. Amount of money requested is significantly low in Mississippi which is evident by their low population and low standard of living. States with low standard of living such as South Dakota, Arkansas, Vermont, among others request low amount of money.
df_loan<-df_loan %>%
mutate(State=ifelse(State %in% c("CT","MA","MD","NH","NJ","NY","PA","RI","VT"),"NorthEast",
ifelse(State %in% c("MS","AL","GA","LA","AR","FL","SC","NC","KY","VA","WV","DE","DC"),"SouthEast",
ifelse(State %in% c("CA","WA","OR","MT","WY","NV","UT","CO","HI","AK"),"West",
ifelse(State %in% c("SD","KS","MN","MO","IL","OH","MI","WI"),"MidWest","SouthWest")))),
EmploymentLength=recode_factor(EmploymentLength,
`< 1 year` = "1",`1 year` = "1", `2 years` = "2",`3 years` = "3",
`4 years` = "4",`5 years` = "5",`6 years` = "6",`6 years` = "6",
`7 years` = "7",`8 years` = "8",`9 years` = "9",`n/a` = "0",
`10+ years` = "10" ),
appl_fico_band=ifelse(appl_fico_band<740,"Good",
ifelse(appl_fico_band>800,"Excellent","Very Good"))
) %>%
rename(
Region=State
)
factor_vars<-df_loan%>% #Generating the distribution of the target (0=Good, 1=Bad) among the categorical variables
select(HomeOwnership,appl_fico_band,TERM,Region,target) %>%
mutate(target = recode_factor(target,
`0` = "Good",
`1` = "Bad" ),
TERM=recode_factor(TERM,
`36` = "Short Term",
`60` = "Long Term" )) %>%
gather(key = "key", value = "value",-target)
factor_vars%>%
ggplot(aes(value)) +
geom_bar(aes(x = value,
fill = target),
alpha = .6,
position = "dodge",
backgroundColor="white",
color = "black",
width = .8
) +
labs(x = "",
y = "",
title = "Barplot of Categorical Variables vs Target (Bad or Good)") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key, scales = "free", nrow = 3) +
scale_fill_manual(
values = c("turquoise4", "turquoise2"),
name = "Target\nBad",
labels = c("Good", "Bad"))
Figure 18: Barplot of Categorical variables vs Target (Bad or Good)
In figure 18, only few of both good and bad loan customers have excellent credit score. Majority of the good customers have good credit score compared to the bad customers and few have very good credit score which is significantly higher than the bad customers. Grouping by home ownership shows that most of the customers that rent or are on mortgage do not default in their loan payment. Among customers who own their own home, most of them do not default in their loan payment. Number of months payments for a loan do not show whether a customer will default or not in their loan payment. Among both long and shot payment, the number of good customers is significantly high compared to bad customers. Even though figure 19 shows the amount of money requested by customers in the 5 regions are approximately equal, figure 18 shows that the proportion of good customers to bad customers taking into account the total number of borrowers from each of the region is approximately
ggplot(df_loan, aes(y = Amount.Requested,
Region,fill=Region)) + geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1) + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
Figure 19: A plot showing distribution of Amount of Money Requested in Regions
df_loan$EmploymentLength<-as.numeric(df_loan$EmploymentLength)
df_loan<-df_loan[,-c(2:3,13:14,19,21,23)]
df<-df_loan[,c(2,5:11,13:14,18)]
corr<-round(cor(df),1)
ggcorrplot::ggcorrplot(corr, hc.order = TRUE, outline.col = "white")
Figure 20: Correlation matrix of the numeric variables
# using hierarchical clustering for computing correlation among the numeric predictors.
df_loan<-df_loan[-6]#Removing number of opening credit lines from the data since it's highly correlated with nuumber of total credit lines
df_loan[, c(3:4,14)]<-lapply(df_loan[,c(3:4,14)],as.factor)#Converting the re-coded variables into their original forms
levels(df_loan$HomeOwnership)[2]<-"RENT" #There are only 2 values for the level OTHER, hence, converted it to RENT
MODEL BUILDING
Building Multivariate Adaptive Regression Splines (MARS) Model
Feature selection:
The nature of MARS features breaks predictor variables into two distinct groups and models linear relationships between the predictor and the outcome in each group. Due to its robustness in terms of handling input variables, data used for MARS models needs little pre-processing such that transformation and filtering of input variables are not necessary. Even though discrepancies among input variables such as their strong correlation with one another, do not significantly affect the model’s performance, they can complicate the model’s interpretation. To circumvent this shortcoming, correlation matrix was computed among the numeric predictors. The correlation between TotalCREDITLines and OpenCREDITLines was a value greater than 0.7: figure 20. Hence, TotalCREDITLines was removed from the data. Total of 6 out of 17 predictors that entered the model were retained. This is because the model thinned the predictors and retain only those significant for the prediction.
Data Splitting:
The MARS model was built with the dataset containing the 18 input variables and the target variable, target. The data set was split randomly at 60% (15938 observations) as the train data and 40% (10624) as the test data. The MARS model was built on the train data using the earth package, and the model performance evaluated on both the train and test data.
set.seed(551)
target<-df_loan$target
predictors<-df_loan[,-c(1,15)]
trainRows<-caret::createDataPartition(target,p=.60,list = FALSE) #Randomly spliiting the data into 40% test data and 60% train data
trainPredictors<-predictors[trainRows,]
trainTarget<-target[trainRows]
testPredictors<-predictors[-trainRows,]
testTarget<-target[-trainRows]
marsfit<-earth::earth(trainPredictors,trainTarget,glm=list(family=binomial),degree=1)#Performing the MARS model using the earth package.
summary(marsfit) #Using the summary() to generate more extensive output from the model
## Call: earth(x=trainPredictors, y=trainTarget, glm=list(family=binomial),
## degree=1)
##
## GLM coefficients
## trainTarget
## (Intercept) -3.2583129
## MonthsSinceDQ13 1.0955130
## LoanPurposecredit_card -0.4607224
## LoanPurposesmall_business 0.9032795
## h(9750-MonthlyIncome) 0.0001490
## h(15.2-RevolvingLineUtilization) 0.0518046
## h(RevolvingLineUtilization-15.2) 0.0124722
## h(3-Inquiries6M) -0.2563617
##
## GLM (family binomial, link logit):
## nulldev df dev df devratio AIC iters converged
## 8817.3 15937 8465.82 15930 0.0399 8482 5 1
##
## Earth selected 8 of 10 terms, and 6 of 119 predictors
## Termination condition: RSq changed by less than 0.001 at 10 terms
## Importance: MonthlyIncome, LoanPurposesmall_business, Inquiries6M, ...
## Number of terms at each degree of interaction: 1 7 (additive model)
## Earth GCV 0.07134819 RSS 1135.008 GRSq 0.02086246 RSq 0.02258197
Summary statistics of MARS model can be difficult to explain. A negative coefficient does not mean a negative association between the target and the predictor. Customers with 13 or more months of delinquency since the last record was taken have higher chance of defaulting on their loan. Also, customers that request loan for credit card purpose stands low chance of defaulting. However, customers with a request for their small business have a high chance of defaulting. customers with monthly income less than 9750 dollars have low chance of defaulting. In terms of revolving line utilization, customers with more than 15.2 revolving lines have low chance of defaulting compared to customers with less than 15.2 revolving lines. If a customer has made less than 3 inquiries for new credit in the last 6 months, there is significantly low chance of the customer defaulting.
kableExtra::kable(caret::varImp(marsfit),caption = "Table.1: Earth variable importance") %>%
kableExtra::kable_classic(html_font="Times New Roman")#Predictors selected by the MARS model
| Overall | |
|---|---|
| MonthlyIncome | 100.00000 |
| LoanPurposesmall_business | 83.14410 |
| Inquiries6M | 69.49103 |
| RevolvingLineUtilization | 53.89279 |
| LoanPurposecredit_card | 28.84886 |
| MonthsSinceDQ13 | 19.16641 |
Table 1 shows the predictors selected by the earth function during the model building as significant in the prediction. Their overall importance is scaled between 0 and 100.
marsTest_pred<-predict(marsfit,newdata=testPredictors,type="response")#Validating the model on the test data
testTarget_testPred<-data.frame(obs=testTarget,
pred=marsTest_pred)
testTarget_testPred %>%
rename(
pred=trainTarget
) %>%
caret::defaultSummary()
## RMSE Rsquared MAE
## 0.26080554 0.01458389 0.13838347
marsTrain_pred<-predict(marsfit,newdata=trainPredictors,type="response")#Validating the model on the train data
trainTarget_trainPred<-data.frame(obs=trainTarget,
pred=marsTrain_pred)
trainTarget_trainPred %>%
rename(
pred=trainTarget,
) %>%
caret::defaultSummary()
## RMSE Rsquared MAE
## 0.26674862 0.02340963 0.14221774
The performance of the MARS model on both the train and test data was statistically significant as observed by its low RMSE value on both the train (0.26674862) and test data (0.26080554). This was also true when their MAE (0.14221774 and 0.13838347 for the train and test data, respectively) were computed. Even though its low Mean Absolute Error rate do not tell the model’s performance, it suggests that the MARS model is great at prediction. ROC curves for both the train and test data together with the area under each curve (auc) were computed. As shown in figure 21, the model’s performance rate on both the train and test data appears statistically significant as the value of the Area under the ROC curve (0.65 and 0.62 for the train and test data, respectively) is close to 1. Their computed AUROC curves are comparable, indicating the model do not overfit the test data.
df_marsTrain_pred<-tibble(trainTarget,marsTrain_pred)#Generating the scoring data for the MARS model on the train data
sort_df_marsTrain_pred<-(df_marsTrain_pred[order(df_marsTrain_pred[,2],decreasing = T),])
#DT::datatable(df_marsTrain_pred,filter = "top",caption="Table.2: Predicted verses Actual values in the train Data (MARS Model)")
df_marsTest_pred<-tibble(testTarget,marsTest_pred)#Generating the scoring data for the logistic model on the test data
sort_mars_Test_pred<-(df_marsTest_pred[order(df_marsTest_pred[,2],decreasing = T),])
#DT::datatable(sort_mars_target_pred,filter = "top",caption="Table.3: Predicted verses Actual values in the test Data (Logistic Model)")
#write.csv(sort_mars_target_pred,"MARS_Scoring.csv")
if(require(ROCR)){trainpred<-prediction(sort_df_marsTrain_pred$marsTrain_pred,sort_df_marsTrain_pred$trainTarget)}
if(require('ROCR')){trainperf<-performance(trainpred,"tpr","fpr")} #Generate a plot of True Positive Rate vs False Positive Rate from the logistic model
marsTrain_auc<-performance(trainpred, measure = "auc") #Computes the area under the ROC curve
print(paste(marsTrain_auc@y.values,"AUC ON TRAIN DATA"))
## [1] "0.651595090410237 AUC ON TRAIN DATA"
if(require(ROCR)){testpred<-prediction(sort_mars_Test_pred$marsTest_pred,sort_mars_Test_pred$testTarget)}
if(require('ROCR')){testperf<-performance(testpred,"tpr","fpr")} #Generate a plot of True Positive Rate vs False Positive Rate from the logistic model
mars_test_auc<-performance(testpred, measure = "auc") #Computes the area under the ROC curve
print(paste(mars_test_auc@y.values,"AUC ON TEST DATA"))
## [1] "0.618229296508351 AUC ON TEST DATA"
plot(unlist(slot(trainperf,"x.values")),unlist(slot(trainperf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="green",
main="Receiver Operating Characteristic (ROC) Plot on the MARS Model")
points(unlist(slot(testperf,"x.values")),unlist(slot(testperf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="blue")
abline(a=0,b=1,col="red")
legend("topleft", legend=c("TRAIN_AUC","Test_AUC","Baseline_AUC"),
col=c("green","blue","red"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("Train_AUC=0.65","Test_AUC=0.62"),
col=c("green","blue"), lty=1:2, cex=0.6)
Figure 21: Receiver Operating Characteristic (ROC) plot on the MARS model
if(require(gains)){Gains<-gains(testTarget,marsTest_pred)}#Computing gains for the model prediction on the test data
Gains
## Depth Cume Cume Pct Mean
## of Cume Mean Mean of Total Lift Cume Model
## File N N Resp Resp Resp Index Lift Score
## -------------------------------------------------------------------------
## 10 1062 1062 0.14 0.14 19.3% 193 193 0.17
## 20 1062 2124 0.10 0.12 32.7% 134 164 0.12
## 30 1063 3187 0.10 0.11 45.8% 130 153 0.10
## 40 1062 4249 0.07 0.10 55.6% 99 139 0.08
## 50 1063 5312 0.07 0.10 65.5% 99 131 0.07
## 60 1062 6374 0.06 0.09 73.1% 76 122 0.07
## 70 1062 7436 0.06 0.09 81.0% 80 116 0.06
## 80 1063 8499 0.06 0.08 88.6% 76 111 0.05
## 90 1062 9561 0.04 0.08 94.1% 54 105 0.04
## 100 1063 10624 0.04 0.07 100.0% 59 100 0.03
From figure 22 and the gains table above, it is observed that depth of file of 50 which has cumulative observation of 5312 out of the 10624 has significant cumulative percentage of 65.5% out of the total response. This implies that in case of predicting customers with the probability of having bad loans or defaulting, if we target the first 50% of the customers (5312) from the predictions made by the model, 65.5% percent of them will default on their loan payment. Thus, approximately 3,480 out of the 5312 customers will default on their loans.
par(mfrow=c(1,3))#A plot of the Mean Response and the Deciles
plot(Gains,ylim = c(0,0.1),xlim=c(0,100))
barplot(Gains$mean.resp, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(Gains$cume.pct.of.total, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")#A barplot of the cummulative percentage of the total response vs the deciles
Figure 22: Gains plot and barplot of Mean Response, and Cumulative Percent of Total Response on Depth of File from the MARS Model
Building Logistic Model
step_model=lm(target ~ .,data=df_loan[,-1])
Stepwise_model=step(step_model,direction="both",test="F")#Stepwise selection was performed on the variables to select those that appear significant at 0.5 significant level
## Start: AIC=-70496.17
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + TotalCREDITLines +
## RevolvingCREDITBalance + RevolvingLineUtilization + Inquiries6M +
## DQ2yrs + MonthsSinceDQ + PublicRec + EmploymentLength + appl_fico_band +
## LoanPurpose + NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - MonthsSinceDQ 93 6.4525 1858.0 -70590 0.9906 0.506091
## - TotalCREDITLines 1 0.0046 1851.6 -70498 0.0657 0.797662
## - RevolvingCREDITBalance 1 0.0219 1851.6 -70498 0.3132 0.575719
## - EmploymentLength 1 0.0447 1851.6 -70498 0.6381 0.424413
## - DQ2yrs 1 0.0736 1851.7 -70497 1.0511 0.305256
## - DTI 1 0.0888 1851.7 -70497 1.2676 0.260222
## - appl_fico_band 2 0.2614 1851.8 -70496 1.8661 0.154739
## <none> 1851.6 -70496
## - NoYear 1 0.1713 1851.8 -70496 2.4458 0.117852
## - Region 4 1.3598 1853.0 -70485 4.8539 0.000653 ***
## - PublicRec 1 1.1066 1852.7 -70482 15.8006 7.057e-05 ***
## - HomeOwnership 2 1.5197 1853.1 -70478 10.8489 1.951e-05 ***
## - MonthlyIncome 1 1.7061 1853.3 -70474 24.3602 8.039e-07 ***
## - RevolvingLineUtilization 1 2.7237 1854.3 -70459 38.8890 4.554e-10 ***
## - Inquiries6M 1 8.3918 1860.0 -70378 119.8184 < 2.2e-16 ***
## - LoanPurpose 13 15.6703 1867.3 -70298 17.2108 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-70589.77
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + TotalCREDITLines +
## RevolvingCREDITBalance + RevolvingLineUtilization + Inquiries6M +
## DQ2yrs + PublicRec + EmploymentLength + appl_fico_band +
## LoanPurpose + NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - TotalCREDITLines 1 0.0208 1858.1 -70591 0.2968 0.5859086
## - EmploymentLength 1 0.0315 1858.1 -70591 0.4500 0.5023201
## - RevolvingCREDITBalance 1 0.0434 1858.1 -70591 0.6198 0.4311121
## - appl_fico_band 2 0.2097 1858.2 -70591 1.4973 0.2237446
## - DTI 1 0.1366 1858.2 -70590 1.9506 0.1625306
## <none> 1858.0 -70590
## - NoYear 1 0.2359 1858.3 -70588 3.3679 0.0664888 .
## - DQ2yrs 1 0.8047 1858.8 -70580 11.4898 0.0007008 ***
## - Region 4 1.3622 1859.4 -70578 4.8625 0.0006429 ***
## - PublicRec 1 1.1381 1859.2 -70576 16.2502 5.566e-05 ***
## - HomeOwnership 2 1.4749 1859.5 -70573 10.5299 2.684e-05 ***
## - MonthlyIncome 1 1.6284 1859.7 -70569 23.2503 1.430e-06 ***
## - RevolvingLineUtilization 1 2.8170 1860.9 -70552 40.2228 2.302e-10 ***
## + MonthsSinceDQ 93 6.4525 1851.6 -70496 0.9906 0.5060910
## - Inquiries6M 1 8.6131 1866.7 -70469 122.9822 < 2.2e-16 ***
## - LoanPurpose 13 15.6729 1873.7 -70393 17.2142 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-70591.47
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingCREDITBalance +
## RevolvingLineUtilization + Inquiries6M + DQ2yrs + PublicRec +
## EmploymentLength + appl_fico_band + LoanPurpose + NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - EmploymentLength 1 0.0299 1858.1 -70593 0.4271 0.5134082
## - RevolvingCREDITBalance 1 0.0352 1858.1 -70593 0.5026 0.4783493
## - appl_fico_band 2 0.2103 1858.3 -70592 1.5013 0.2228486
## - DTI 1 0.1183 1858.2 -70592 1.6892 0.1937227
## <none> 1858.1 -70591
## + TotalCREDITLines 1 0.0208 1858.0 -70590 0.2968 0.5859086
## - NoYear 1 0.2929 1858.4 -70589 4.1826 0.0408502 *
## - DQ2yrs 1 0.7892 1858.9 -70582 11.2694 0.0007891 ***
## - Region 4 1.3798 1859.5 -70580 4.9254 0.0005736 ***
## - PublicRec 1 1.1449 1859.2 -70577 16.3478 5.286e-05 ***
## - HomeOwnership 2 1.5686 1859.6 -70573 11.1988 1.375e-05 ***
## - MonthlyIncome 1 1.7260 1859.8 -70569 24.6457 6.933e-07 ***
## - RevolvingLineUtilization 1 2.9887 1861.0 -70551 42.6756 6.578e-11 ***
## + MonthsSinceDQ 93 6.4687 1851.6 -70498 0.9932 0.4992160
## - Inquiries6M 1 8.6255 1866.7 -70470 123.1626 < 2.2e-16 ***
## - LoanPurpose 13 15.7240 1873.8 -70394 17.2708 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-70593.05
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingCREDITBalance +
## RevolvingLineUtilization + Inquiries6M + DQ2yrs + PublicRec +
## appl_fico_band + LoanPurpose + NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - RevolvingCREDITBalance 1 0.0375 1858.1 -70595 0.5358 0.4641817
## - appl_fico_band 2 0.2091 1858.3 -70594 1.4928 0.2247554
## - DTI 1 0.1209 1858.2 -70593 1.7257 0.1889764
## <none> 1858.1 -70593
## + EmploymentLength 1 0.0299 1858.1 -70591 0.4271 0.5134082
## + TotalCREDITLines 1 0.0192 1858.1 -70591 0.2739 0.6007562
## - NoYear 1 0.2654 1858.4 -70591 3.7895 0.0515871 .
## - DQ2yrs 1 0.7893 1858.9 -70584 11.2712 0.0007883 ***
## - Region 4 1.3820 1859.5 -70581 4.9335 0.0005653 ***
## - PublicRec 1 1.1739 1859.3 -70578 16.7623 4.249e-05 ***
## - HomeOwnership 2 1.5414 1859.6 -70575 11.0049 1.670e-05 ***
## - MonthlyIncome 1 1.7222 1859.8 -70570 24.5922 7.128e-07 ***
## - RevolvingLineUtilization 1 2.9947 1861.1 -70552 42.7620 6.294e-11 ***
## + MonthsSinceDQ 93 6.4548 1851.6 -70499 0.9910 0.5049997
## - Inquiries6M 1 8.6280 1866.7 -70472 123.2007 < 2.2e-16 ***
## - LoanPurpose 13 15.7306 1873.8 -70395 17.2784 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-70594.51
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingLineUtilization +
## Inquiries6M + DQ2yrs + PublicRec + appl_fico_band + LoanPurpose +
## NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - appl_fico_band 2 0.2007 1858.3 -70596 1.4331 0.2385937
## <none> 1858.1 -70595
## - DTI 1 0.1558 1858.3 -70594 2.2241 0.1358835
## - NoYear 1 0.2346 1858.4 -70593 3.3504 0.0672011 .
## + RevolvingCREDITBalance 1 0.0375 1858.1 -70593 0.5358 0.4641817
## + EmploymentLength 1 0.0322 1858.1 -70593 0.4603 0.4974932
## + TotalCREDITLines 1 0.0111 1858.1 -70593 0.1584 0.6906446
## - DQ2yrs 1 0.7701 1858.9 -70586 10.9971 0.0009138 ***
## - Region 4 1.3810 1859.5 -70583 4.9300 0.0005688 ***
## - PublicRec 1 1.1429 1859.3 -70580 16.3193 5.367e-05 ***
## - HomeOwnership 2 1.5051 1859.6 -70577 10.7460 2.163e-05 ***
## - MonthlyIncome 1 1.7076 1859.8 -70572 24.3840 7.940e-07 ***
## - RevolvingLineUtilization 1 3.3391 1861.5 -70549 47.6800 5.131e-12 ***
## + MonthsSinceDQ 93 6.4711 1851.7 -70501 0.9936 0.4981121
## - Inquiries6M 1 8.5980 1866.7 -70474 122.7741 < 2.2e-16 ***
## - LoanPurpose 13 15.7025 1873.8 -70397 17.2478 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-70595.64
## target ~ DTI + Region + HomeOwnership + MonthlyIncome + RevolvingLineUtilization +
## Inquiries6M + DQ2yrs + PublicRec + LoanPurpose + NoYear
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1858.3 -70596
## - DTI 1 0.1560 1858.5 -70595 2.2269 0.1356385
## + appl_fico_band 2 0.2007 1858.1 -70595 1.4331 0.2385937
## + EmploymentLength 1 0.0308 1858.3 -70594 0.4393 0.5074753
## + RevolvingCREDITBalance 1 0.0292 1858.3 -70594 0.4162 0.5188299
## + TotalCREDITLines 1 0.0124 1858.3 -70594 0.1764 0.6745051
## - NoYear 1 0.2761 1858.6 -70594 3.9425 0.0470902 *
## - DQ2yrs 1 0.8710 1859.2 -70585 12.4363 0.0004218 ***
## - Region 4 1.3960 1859.7 -70584 4.9835 0.0005162 ***
## - PublicRec 1 1.2161 1859.5 -70580 17.3646 3.095e-05 ***
## - HomeOwnership 2 1.5449 1859.9 -70578 11.0297 1.629e-05 ***
## - MonthlyIncome 1 1.7210 1860.1 -70573 24.5743 7.194e-07 ***
## - RevolvingLineUtilization 1 4.1234 1862.5 -70539 58.8772 1.736e-14 ***
## + MonthsSinceDQ 93 6.4157 1851.9 -70502 0.9850 0.5214999
## - Inquiries6M 1 8.5317 1866.9 -70476 121.8230 < 2.2e-16 ***
## - LoanPurpose 13 15.6809 1874.0 -70398 17.2236 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Feature selection:
Stepwise selection was used to select significant predictors for the logistic regression. Out of the 17 predictors, 9 were selected as significant for the prediction at 0.05 significant level. However, after modeling the logistic regression with the 9 predictors, the summary statistics from the model showed that NoYear, DQ2yrs and HomeOwnership were insignificant in the prediction. These predictors were removed and the model built with the remaining 6 predictors. The numeric variables were binned with BinProfet () function in Rprofet package and smoothed or pruned with WOE_custom () function also part of Rprofet package.
df_subset<-select(df_loan,LOAN_ID,NoYear,DQ2yrs,Region,PublicRec,HomeOwnership,MonthlyIncome,RevolvingLineUtilization,Inquiries6M,LoanPurpose,target)
bin_vars<-Rprofet::BinProfet(dat=df_subset,'LOAN_ID',varcol = c(2:3,5,7,8,9),target = 'target')#Bucketing the predictors with the function BinProfet before model building
#Plots of binned predictors
bin_plots<-function(x,y){
Rprofet::WOEplotter(dat=bin_vars,var = x,target=y)
}
bin_plots('MonthlyIncome_Bins','target')#Generating plots for the binned predictors to check if there is the need to prune any before they enter the model
bin_plots('NoYear_Bins','target')
bin_plots('RevolvingLineUtilization_Bins','target')
bin_plots('DQ2yrs_Bins','target')
bin_plots('Inquiries6M_Bins','target')
bin_plots('PublicRec_Bins','target')
#Plots of the smoothed/pruned numeric predictors
df_subset$DQ2yrs<-Rprofet::WOE_custom(df_subset,'DQ2yrs', target='target', breaks=c(0,0.2,0.5,2.5,Inf), right_bracket = F, color = "#A4AA83")
df_subset$RevolvingLineUtilization<-Rprofet::WOE_custom(df_subset,'RevolvingLineUtilization', target='target', breaks=c(0,25,42,55,65,75,87,Inf), right_bracket = F, color = "#A4AA83")
df_subset$Inquiries6M<-Rprofet::WOE_custom(df_subset,'Inquiries6M', target='target', breaks=c(0,0.02,1,2,Inf), right_bracket = F, color = "#A4AA83")
df_subset$MonthlyIncome<-Rprofet::WOE_custom(df_subset,'MonthlyIncome', target='target', breaks=c(-Inf,3.2e+03,4.2e+03,5.5e+03,7.6e+03,Inf), right_bracket = F, color ="#A4AA83")
df_subset$NoYear<-Rprofet::WOE_custom(df_subset,'NoYear', target='target', breaks=c(-Inf,7,10,12,15,20,Inf), right_bracket = F, color = "#A4AA83")
df_subset$PublicRec<-Rprofet::WOE_custom(df_subset,'PublicRec', target='target', breaks=c(0,0.001,0.05,Inf), right_bracket = F, color = "#A4AA83")
logit_predictors<-df_subset[,-c(1:3,6,11)]#Preparing the binned predictors for the logistic model by removing the DebtDimId and the target "Bad" from the dataset
logit_target<-df_subset[,11] #Setting the target variable aside from the dataset
set.seed(551) #Randomly spliiting the dataset into 40% test data and 60% train data
trainingRows<-caret::createDataPartition(logit_target,p=.6,list=F)
trainX<-logit_predictors[trainingRows,]
trainY<-logit_target[trainingRows]
testX<-logit_predictors[-trainingRows,]
testY<-logit_target[-trainingRows]
logit_Fit<-glm(trainY ~.,data=trainX,family = "binomial") #Logistic model with the function glm performed on the train data
summary(logit_Fit)#summary statistics from the logistic model
##
## Call:
## glm(formula = trainY ~ ., family = "binomial", data = trainX)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0168 -0.4416 -0.3690 -0.2972 2.8452
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.32230 0.23956 -13.868
## RegionNorthEast 0.01563 0.10225 0.153
## RegionSouthEast 0.17986 0.10242 1.756
## RegionSouthWest 0.07452 0.12448 0.599
## RegionWest 0.27003 0.09890 2.730
## PublicRec[0.05,Inf) 0.24733 0.10920 2.265
## MonthlyIncome[3.2e+03,4.2e+03) -0.24593 0.08696 -2.828
## MonthlyIncome[4.2e+03,5.5e+03) -0.34352 0.08724 -3.938
## MonthlyIncome[5.5e+03,7.6e+03) -0.53836 0.09033 -5.960
## MonthlyIncome[7.6e+03, Inf) -0.98129 0.10370 -9.463
## RevolvingLineUtilization[25,42) -0.08388 0.11891 -0.705
## RevolvingLineUtilization[42,55) 0.14620 0.11591 1.261
## RevolvingLineUtilization[55,65) 0.23035 0.11916 1.933
## RevolvingLineUtilization[65,75) 0.32329 0.11623 2.781
## RevolvingLineUtilization[75,87) 0.46605 0.11058 4.215
## RevolvingLineUtilization[87,Inf) 0.57645 0.10849 5.313
## Inquiries6M[1,2) 0.31922 0.07292 4.378
## Inquiries6M[2,Inf) 0.61238 0.07249 8.448
## LoanPurposecredit_card 0.09380 0.23406 0.401
## LoanPurposedebt_consolidation 0.53804 0.21504 2.502
## LoanPurposeeducational 1.43166 0.42890 3.338
## LoanPurposehome_improvement 0.55735 0.24253 2.298
## LoanPurposehouse 0.64113 0.35989 1.781
## LoanPurposemajor_purchase 0.54771 0.25221 2.172
## LoanPurposemedical 0.94108 0.29444 3.196
## LoanPurposemissing 0.79411 0.22794 3.484
## LoanPurposemoving 0.63631 0.31707 2.007
## LoanPurposerenewable_energy 1.33166 0.49570 2.686
## LoanPurposesmall_business 1.47323 0.23417 6.291
## LoanPurposevacation 0.79828 0.35463 2.251
## LoanPurposewedding 0.30707 0.31306 0.981
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## RegionNorthEast 0.878523
## RegionSouthEast 0.079073 .
## RegionSouthWest 0.549394
## RegionWest 0.006328 **
## PublicRec[0.05,Inf) 0.023520 *
## MonthlyIncome[3.2e+03,4.2e+03) 0.004685 **
## MonthlyIncome[4.2e+03,5.5e+03) 0.000082265828 ***
## MonthlyIncome[5.5e+03,7.6e+03) 0.000000002524 ***
## MonthlyIncome[7.6e+03, Inf) < 0.0000000000000002 ***
## RevolvingLineUtilization[25,42) 0.480573
## RevolvingLineUtilization[42,55) 0.207203
## RevolvingLineUtilization[55,65) 0.053214 .
## RevolvingLineUtilization[65,75) 0.005413 **
## RevolvingLineUtilization[75,87) 0.000025009424 ***
## RevolvingLineUtilization[87,Inf) 0.000000107660 ***
## Inquiries6M[1,2) 0.000012003777 ***
## Inquiries6M[2,Inf) < 0.0000000000000002 ***
## LoanPurposecredit_card 0.688595
## LoanPurposedebt_consolidation 0.012346 *
## LoanPurposeeducational 0.000844 ***
## LoanPurposehome_improvement 0.021556 *
## LoanPurposehouse 0.074833 .
## LoanPurposemajor_purchase 0.029881 *
## LoanPurposemedical 0.001392 **
## LoanPurposemissing 0.000494 ***
## LoanPurposemoving 0.044765 *
## LoanPurposerenewable_energy 0.007222 **
## LoanPurposesmall_business 0.000000000315 ***
## LoanPurposevacation 0.024385 *
## LoanPurposewedding 0.326662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8817.3 on 15937 degrees of freedom
## Residual deviance: 8474.1 on 15907 degrees of freedom
## AIC: 8536.1
##
## Number of Fisher Scoring iterations: 5
#NoYear appears insignificant at 0.5 sig. level, hence removed from the data
The summary statistics of the logistic regression model shows that, at 5% significant level, borrowers or customers from the West region stands a low chance of defaulting on their loan payment whereas those from NorthEast, SouthEast and SouthWest have a high chance of defaulting. Increase in monthly income of borrowers decreases the probability that they will default on their loan payment. It is also observed that increase in revolving line utilization of borrowers decrease the probability that a borrower will default. Customers that borrows loans for small business, renewable energy, moving purpose, medical expenses, home improvement, education and debt consolidation usually have low probability of defaulting on their loan payment.
logitTrain_pred<-predict(logit_Fit,trainX,type = "response")#validating the model with the train data
df_logitTrain_pred<-data.frame(obs=trainY,
pred=logitTrain_pred)
caret::defaultSummary(df_logitTrain_pred)#The Root Means Square Error (RMSE) and Mean Absolute Error associated with the model
## RMSE Rsquared MAE
## 0.26689210 0.02236349 0.14236016
logit_pred<-predict(logit_Fit,testX,type = "response")#validating the model with the test data
logit_values<-data.frame(obs=testY,
pred=logit_pred)#The Root Means Square Error (RMSE) and Mean Absolute Error associated with the model
caret::defaultSummary(logit_values)
## RMSE Rsquared MAE
## 0.26051395 0.01636385 0.13855451
The logistic regression model was built on the train data using the glm () function and the model performance evaluated on both the train and test data. The performance of the model on both the train and test data appears statistically significant as observed by its low RMSE value on both the train (0.26689210) and test data (0.26051395). This was also true when their MAE (0.14236016 and 0.13855451 for the train and test data, respectively) were computed. ROC curves for both the train and test data together with the area under each curve (auc) were computed. As shown in figure 23 the model’s performance rate on both the train and test data appears statistically significant as the value of the Area under their ROC curve (0.65 and 0.63 for the train and test data, respectively) is close to 1. Their computed AUROC curves are comparable, indicating the model do not overfit the test data.
df_LogisticTrain_pred<-tibble(trainY,logitTrain_pred)#Generating the scoring data for the logistic model on the train data
sort_df_LogisticTrain_pred<-(df_LogisticTrain_pred[order(df_LogisticTrain_pred[,2],decreasing = T),])
#DT::datatable(df_marsTrain_pred,filter = "top",caption="Table.2: Predicted verses Actual values in the train Data (Logistic Model)")
logit_results<-tibble(testY,logit_pred)#Generating the scoring data for the logistic model
sort_logit_pred<-(logit_results[order(logit_results[,2],decreasing = T),])
#DT::datatable(logit_results,filter = "top",caption="Table.1: Predicted verses Actual values in the test data (Logistic Model)")
#write.csv(sort_logit_pred,"Logistic_Binned.csv")
if(require('ROCR')){logitTrain_prediction<-prediction(sort_df_LogisticTrain_pred$logitTrain_pred,sort_df_LogisticTrain_pred$trainY)}
if(require('ROCR')){logitTrain_perf<-performance(logitTrain_prediction,"tpr","fpr")}
if(require(ROCR)){logit_prediction<-prediction(sort_logit_pred$logit_pred,sort_logit_pred$testY)}
if(require('ROCR')){logit_perf<-performance(logit_prediction,"tpr","fpr")}
logitTrain_auc<-performance(logitTrain_prediction, measure = "auc") #Computes the area under the ROC curve
print(paste(logitTrain_auc@y.values,"LOGISTIC TRAIN AUC"))
## [1] "0.650555279784403 LOGISTIC TRAIN AUC"
logitTest_auc<-performance(logit_prediction, measure = "auc") #Computes the area under the ROC curve
print(paste(logitTest_auc@y.values,"LOGISTIC TEST AUC"))
## [1] "0.626715324168999 LOGISTIC TEST AUC"
plot(unlist(slot(logitTrain_perf,"x.values")),unlist(slot(logitTrain_perf,"y.values")),
ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="green",main="Receiver Operating Characteristic (ROC) Plot on Logistic Model")
points(unlist(slot(logit_perf,"x.values")),unlist(slot(logit_perf,"y.values")),
ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="blue")
abline(a=0,b=1,col="red")
legend("topleft", legend=c("Train_AUC","Test_AUC","Baseline_AUC"),
col=c("green","blue", "red"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("LogitTrain_AUC=0.65","LogitTest_AUC=0.63"),
col=c("green","blue"), lty=1:2, cex=0.6)
Figure 23: Receiver Operating Characteristic (ROC) plot on the Logistic model
if(require(gains)){logit_Gains<-gains(testY,logit_pred)}#Computing gains for the model prediction on the test data
logit_Gains
## Depth Cume Cume Pct Mean
## of Cume Mean Mean of Total Lift Cume Model
## File N N Resp Resp Resp Index Lift Score
## -------------------------------------------------------------------------
## 10 1064 1064 0.15 0.15 20.4% 203 203 0.17
## 20 1060 2124 0.11 0.13 34.5% 142 173 0.12
## 30 1072 3196 0.10 0.12 48.5% 139 161 0.10
## 40 1054 4250 0.06 0.11 56.8% 83 142 0.08
## 50 1062 5312 0.06 0.10 64.9% 81 130 0.07
## 60 1062 6374 0.07 0.09 74.5% 96 124 0.07
## 70 1065 7439 0.07 0.09 83.3% 88 119 0.06
## 80 1060 8499 0.05 0.08 89.6% 63 112 0.05
## 90 1064 9563 0.04 0.08 95.1% 54 106 0.04
## 100 1061 10624 0.04 0.07 100.0% 49 100 0.03
From figure ?? and the gains table above, it is observed that depth of file of 50 which has cumulative observation of 5312 out of the 10624 has significant cumulative percentage of 64.9% out of the total response. This implies that in case of predicting customers with the probability of having bad loans or defaulting, if we target the first 50% of the customers (5312) from the predictions made by the model, 64.9% percent of them will default on their loan payment.
par(mfrow=c(1,3))#A plot of the Mean Response and the Deciles
plot(logit_Gains,ylim = c(0,0.1),xlim=c(0,100))
barplot(logit_Gains$mean.resp, names.arg = logit_Gains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(logit_Gains$cume.pct.of.total, names.arg = logit_Gains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")#A barplot of the cummulative percentage of the total response vs the deciles
Figure 24: Receiver Operating Characteristic (ROC) plot on the Logistic model
COMPARISON BEWTEEN MARS AND LOGISTIC MODEL
Area Under the ROC curve (AUC) for the logistic model was higher compared to the MARS model. However, the MARS model performs better on the gains table. Kolmogorov-Smirnov (KS) curves measure model fit. KS curves consist of two lines—one for TPR and one for FPR. The y-values are the rates at a percentile and the percentiles are along the x-axis. The further apart the two lines are, the better. This indicates that the model was able to separate the “1s” from the “0s”. The KS statistic is the largest difference between the TPR and FPR at a given percentile. A perfect model would be 1 and random selection would be 0. From figure 26, the logistic model performs better compared to the MARS model.
plot(unlist(slot(testperf, 'x.values')),unlist(slot(testperf,'y.values')),ylab="True Positive Rate",
xlab="False Positive Rate",type="l",col="blue",main="AUROC for Logistics and MARS Model")
points(unlist(slot(logit_perf,"x.values")),unlist(slot(logit_perf,"y.values")),type="l",col="red")
abline(a=0,b=1,col="brown")
legend("topleft", legend=c("MARS Model", "Logistic Model","Random Model"),
col=c("blue","red","brown"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("MARS Model=0.62", "Logistic Model=0.63"),
col=c("blue", "red"), lty=1:2, cex=0.6) #Comparing the the predictive power of MARS Model and the Logistic
Figure 25: AUROC for Logistics and MARS Model
#Model by computing their AUROC curves
#KS-Plot on MARS Prediction
par(mfrow=c(1,2))
KS_mars<-as.data.frame(cbind(testperf@x.values[[1]], testperf@y.values[[1]]))
Percentile <- NULL
Difference <- NULL
for (i in 1:nrow(KS_mars)){
KS_mars[i, 3] = i/nrow(KS_mars)
KS_mars[i, 4]= abs(KS_mars[i,2]-KS_mars[i,1])
}
colnames(KS_mars) <- c("FPR", "TPR", "Percentile", "Difference")
plot(KS_mars$Percentile, KS_mars$TPR, type = "l", col = "blue",
main = "KS Chart for MARS Prediction",
ylab = "TPR/FPR",
xlab = "Percentile")
points(KS_mars$Percentile, KS_mars$FPR, type = "l", col = "red")
abline(0,1, lty =2)
abline(v=0.49, lty = 2)
#KS-Plot on Logistic Prediction
KS_logit<- as.data.frame(cbind(logit_perf@x.values[[1]], logit_perf@y.values[[1]]))
Percentile <- NULL
Difference <- NULL
for (i in 1:nrow(KS_logit)){
KS_logit[i, 3] = i/nrow(KS_logit)
KS_logit[i, 4]= abs(KS_logit[i,2]-KS_logit[i,1])
}
colnames(KS_logit) <- c("FPR", "TPR", "Percentile", "Difference")
plot(KS_logit$Percentile, KS_logit$TPR, type = "l", col = "blue",
main = "KS Chart for Logistic Prediction",
ylab = "TPR/FPR",
xlab = "Percentile")
points(KS_logit$Percentile, KS_logit$FPR, type = "l", col = "red")
abline(0,1, lty =2)
abline(v=0.3649, lty = 2)
Figure 26: KS-Charts on the MARS and Logistic model
CONCLUSION
The exploratory analysis shows that credit score is the ultimate flag to determine if a customer requesting for a load will default on the loan payment or not. However, this predictor variable was not selected by either of the models as significant in the prediction process. From the analysis, majority of good loan customers have good credit score compared to bad loan customers and few have very good credit score which is significantly higher than the bad loan customers. Customers grouped by home ownership shows that most borrowers that rent or are on mortgage do not default on their loan payment. Among borrowers who own their own homes, most of them do not default on their loan payment. Number of months for payments of a loan do not show whether a customer will default or not on their loan payment. Debt Consolidation is the most common purpose of loans. But the proportion is different among different types of home ownership. It was observed that the amount of loans requested in high standard of living states such as California, Florida, New York, Texas and Maryland are significantly high compared to states with low standard of living.
The model performance of the MARS and logistic models are comparable in terms of predictive power. The ROC curves and AUC values showed that neither model suffered from overfitting, however, the logistic model appeared better in its prediction. While the MARS model had better values for the gains table, its KS statistic was not comparable to the logistic model. Hence, the logistic model will better performance in predicting new customers with the probability of defaulting on their loan payment. Factors that were common among the two models in predicting Bad loan include monthly income, revolving line utilization, number of inquires a customer had made for new credit in the last 6 months and the purpose of the loan. There is therefor the need for investors to look out for borrowers with high monthly income, little to no credit inquires and have low revolving credit line utilization. Purpose for a particular loan is another important factor which should be considered especially with borrowers requesting money wedding.